Netdev List
 help / color / mirror / Atom feed
* Re: DDoS attack causing bad effect on conntrack searches
From: Patrick McHardy @ 2010-06-01 10:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, hawk, Jesper Dangaard Brouer, paulmck,
	Linux Kernel Network Hackers, Netfilter Developers
In-Reply-To: <1275368732.2478.88.camel@edumazet-laptop>

Eric Dumazet wrote:
> Le mardi 01 juin 2010 à 08:28 +0800, Changli Gao a écrit :
>> On Tue, Jun 1, 2010 at 5:21 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> I had a look at current conntrack and found the 'unconfirmed' list was
>>> maybe a candidate for a potential blackhole.
>>>
>>> That is, if a reader happens to hit an entry that is moved from regular
>>> hash table slot 'hash' to unconfirmed list,
>> Sorry, but I can't find where we do this things. unconfirmed list is
>> used to track the unconfirmed cts, whose corresponding skbs are still
>> in path from the first to the last netfilter hooks. As soon as the
>> skbs end their travel in netfilter, the corresponding cts will be
>> confirmed(moving ct from unconfirmed list to regular hash table).
>>
> 
> So netfilter is a monolithic thing.
> 
> When a packet begins its travel into netfilter, you guarantee that no
> other packet can also begin its travel and find an unconfirmed
> conntrack ?

Correct, the unconfirmed list exists only for cleanup.

> I wonder why we use atomic ops then to track the confirmed bit :)

Good question, that looks unnecessary :)

>> unconfirmed list should be small, as networking receiving is in BH.
> 
> So according to you, netfilter/ct runs only in input path ?
> 
> So I assume a packet is handled by CPU X, creates a new conntrack
> (possibly early droping an old entry that was previously in a standard
> hash chain), inserted in unconfirmed list. _You_ guarantee another CPU
> Y, handling another packet, possibly sent by a hacker reading your
> netdev mails, cannot find the conntrack that was early dropped ?
> 
>> How about implementing unconfirmed list as a per cpu variable?
> 
> I first implemented such a patch to reduce cache line contention, then I
> asked to myself : What is exactly an unconfirmed conntrack ? Can their
> number be unbounded ? If yes, we have a problem, even on a two cpus
> machine. Using two lists instead of one wont solve the fundamental
> problem.

If a new conntrack is created in PRE_ROUTING or LOCAL_OUT, it will be
added to the unconfirmed list and moved to the hash as soon as the
packet passes POST_ROUTING. This means the number of unconfirmed entries
created by the network is bound by the number of CPUs due to BH
processing. The number created by locally generated packets is unbound
in case of preemptible kernels however.

> The real question is, why do we need this unconfirmed 'list' in the
> first place. Is it really a private per cpu thing ? Can you prove this,
> in respect of lockless lookups, and things like NFQUEUE ? 

Its used for cleaning up conntracks not in the hash table yet on
module unload (or manual flush). It is supposed to be write-only
during regular operation.

> Each conntrack object has two list anchors. One for IP_CT_DIR_ORIGINAL,
> one for IP_CT_DIR_REPLY.
> 
> Unconfirmed list use the first anchor. This means another cpu can
> definitely find an unconfirmed item in a regular hash chain, since we
> dont respect an RCU grace period before re-using an object.
> 
> If memory was not a problem, we probably would use a third anchor to
> avoid this, or regular RCU instead of SLAB_DESTROY_BY_RCU variant.

So I guess we should check the CONFIRMED bit when searching
in the hash table.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [Patch] fix packet loss and massive ping spikes with PPP multi-link
From: Richard Hartmann @ 2010-06-01 10:20 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Ben McKeegan, netdev, linux-ppp, Alan Cox, Alexander E. Patrakov,
	linux-kernel
In-Reply-To: <20100529021624.GA2538@brick.ozlabs.ibm.com>

On Sat, May 29, 2010 at 04:16, Paul Mackerras <paulus@samba.org> wrote:


> I'll
> ack it and hopefully DaveM will pick it up.

As 2.6.35-rc1 is out, does this mean that we are looking at 2.6.36 at the
earliest? Or could this make it in as it is

a) a bug fix
b) tested in the field
c) a small change


Thanks,
Richard

^ permalink raw reply

* Re: DDoS attack causing bad effect on conntrack searches
From: Eric Dumazet @ 2010-06-01 10:31 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Changli Gao, hawk, Jesper Dangaard Brouer, paulmck,
	Linux Kernel Network Hackers, Netfilter Developers
In-Reply-To: <4C04DE73.6050605@trash.net>

Le mardi 01 juin 2010 à 12:18 +0200, Patrick McHardy a écrit :

> If a new conntrack is created in PRE_ROUTING or LOCAL_OUT, it will be
> added to the unconfirmed list and moved to the hash as soon as the
> packet passes POST_ROUTING. This means the number of unconfirmed entries
> created by the network is bound by the number of CPUs due to BH
> processing. The number created by locally generated packets is unbound
> in case of preemptible kernels however.
> 

OK, we should have a percpu list then.

BTW, I notice nf_conntrack_untracked is incorrectly annotated
'__read_mostly'.

It can be written very often :(

Should'nt we special case it and let be really const ?




^ permalink raw reply

* Re: DDoS attack causing bad effect on conntrack searches
From: Patrick McHardy @ 2010-06-01 10:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, hawk, Jesper Dangaard Brouer, paulmck,
	Linux Kernel Network Hackers, Netfilter Developers
In-Reply-To: <1275388310.2738.2.camel@edumazet-laptop>

Eric Dumazet wrote:
> Le mardi 01 juin 2010 à 12:18 +0200, Patrick McHardy a écrit :
> 
>> If a new conntrack is created in PRE_ROUTING or LOCAL_OUT, it will be
>> added to the unconfirmed list and moved to the hash as soon as the
>> packet passes POST_ROUTING. This means the number of unconfirmed entries
>> created by the network is bound by the number of CPUs due to BH
>> processing. The number created by locally generated packets is unbound
>> in case of preemptible kernels however.
>>
> 
> OK, we should have a percpu list then.

Yes, that makes sense.

> BTW, I notice nf_conntrack_untracked is incorrectly annotated
> '__read_mostly'.
> 
> It can be written very often :(
> 
> Should'nt we special case it and let be really const ?

That would need quite a bit of special-casing to avoid touching
the reference counts. So far this is completely hidden, so I'd
say it just shouldn't be marked __read_mostly. Alternatively we
can make "untracked" a nfctinfo state.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 3/3] vhost: apply cpumask and cgroup to vhost workers
From: Tejun Heo @ 2010-06-01 10:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Oleg Nesterov, Sridhar Samudrala, netdev, lkml,
	kvm@vger.kernel.org, Andrew Morton, Dmitri Vorobiev, Jiri Kosina,
	Thomas Gleixner, Ingo Molnar, Andi Kleen
In-Reply-To: <20100601101703.GB9178@redhat.com>

Hello,

On 06/01/2010 12:17 PM, Michael S. Tsirkin wrote:
> Something that I wanted to figure out - what happens if the
> CPU mask limits us to a certain CPU that subsequently goes offline?

The thread gets unbound during the last steps of cpu offlining.

> Will e.g. flush block forever or until that CPU comes back?
> Also, does singlethreaded workqueue behave in the same way?

So, things will proceed as usual although the thread will lose its
affinity.  Singlethread wqs don't bind their workers (and they
shouldn't! :-).  MT ones explicitly manage workers according to cpu
up/down events.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [Patch] fix packet loss and massive ping spikes with PPP  multi-link
From: Ben McKeegan @ 2010-06-01 11:18 UTC (permalink / raw)
  To: Richard Hartmann
  Cc: Paul Mackerras, netdev, linux-ppp, Alan Cox,
	Alexander E. Patrakov, linux-kernel
In-Reply-To: <AANLkTil7fk_ikrSPiZnqcJSLGCrzcCrCwbeuUMY5Yu5v@mail.gmail.com>

Richard Hartmann wrote:

> As 2.6.35-rc1 is out, does this mean that we are looking at 2.6.36 at the
> earliest? Or could this make it in as it is
> 
> a) a bug fix

This isn't really a bug fix.  Its a behavioural change to work around 
poor quality/mismatched underlying PPP channels.

Regards,
Ben.

^ permalink raw reply

* Re: [Patch] fix packet loss and massive ping spikes with PPP multi-link
From: Richard Hartmann @ 2010-06-01 11:28 UTC (permalink / raw)
  To: Ben McKeegan
  Cc: Paul Mackerras, netdev, linux-ppp, Alan Cox,
	Alexander E. Patrakov, linux-kernel
In-Reply-To: <4C04ECA3.1080905@netservers.co.uk>

On Tue, Jun 1, 2010 at 13:18, Ben McKeegan <ben@netservers.co.uk> wrote:

> This isn't really a bug fix.  Its a behavioural change to work around poor
> quality/mismatched underlying PPP channels.

Maybe not a bug in the Linux kernel itself, but certainly in the real world
that exists around Linux. Similar to how a change to a device driver that
is needed to work around broken hardware is a bug fix, imo.


RIchard

^ permalink raw reply

* [PATCH 1/2] mac8390: propagate error code from request_irq
From: Finn Thain @ 2010-06-01 12:18 UTC (permalink / raw)
  To: David Miller; +Cc: geert, joe, p_gortmaker, netdev, linux-kernel, linux-m68k
In-Reply-To: <20100531.060225.68146363.davem@davemloft.net>


Use the request_irq() error code as the return value for mac8390_open(). 
EAGAIN doesn't make sense for Nubus slot IRQs. Only this driver can claim 
this IRQ (until the NIC is removed, which means everything is powered 
down).

Signed-off-by: Finn Thain <fthain@telegraphics.com.au>

Index: linux-2.6.34/drivers/net/mac8390.c
===================================================================
--- linux-2.6.34.orig/drivers/net/mac8390.c	2010-06-01 20:20:29.000000000 +1000
+++ linux-2.6.34/drivers/net/mac8390.c	2010-06-01 20:22:23.000000000 +1000
@@ -643,12 +643,13 @@ static int __init mac8390_initdev(struct
 
 static int mac8390_open(struct net_device *dev)
 {
+	int err;
+
 	__ei_open(dev);
-	if (request_irq(dev->irq, __ei_interrupt, 0, "8390 Ethernet", dev)) {
-		pr_info("%s: unable to get IRQ %d.\n", dev->name, dev->irq);
-		return -EAGAIN;
-	}
-	return 0;
+	err = request_irq(dev->irq, __ei_interrupt, 0, "8390 Ethernet", dev);
+	if (err)
+		pr_info("%s: unable to get IRQ %d\n", dev->name, dev->irq);
+	return err;
 }
 
 static int mac8390_close(struct net_device *dev)

^ permalink raw reply

* [PATCH 2/2] mac8390: raise error logging priority
From: Finn Thain @ 2010-06-01 12:18 UTC (permalink / raw)
  To: David Miller; +Cc: geert, joe, p_gortmaker, netdev, linux-kernel, linux-m68k
In-Reply-To: <20100531.060225.68146363.davem@davemloft.net>


Log error conditions using KERN_ERR priority.

Signed-off-by: Finn Thain <fthain@telegraphics.com.au>

---

This patch doesn't address the KERN_INFO messages conditional on ei_debug. 
I don't know how those can be improved to the satisfaction of all 
interested parties.

These two patches should be applied in sequence.

Index: linux-2.6.34/drivers/net/mac8390.c
===================================================================
--- linux-2.6.34.orig/drivers/net/mac8390.c	2010-06-01 20:22:23.000000000 +1000
+++ linux-2.6.34/drivers/net/mac8390.c	2010-06-01 20:22:25.000000000 +1000
@@ -556,7 +556,7 @@ static int __init mac8390_initdev(struct
 	case MAC8390_APPLE:
 		switch (mac8390_testio(dev->mem_start)) {
 		case ACCESS_UNKNOWN:
-			pr_info("Don't know how to access card memory!\n");
+			pr_err("Don't know how to access card memory!\n");
 			return -ENODEV;
 			break;
 
@@ -648,7 +648,7 @@ static int mac8390_open(struct net_devic
 	__ei_open(dev);
 	err = request_irq(dev->irq, __ei_interrupt, 0, "8390 Ethernet", dev);
 	if (err)
-		pr_info("%s: unable to get IRQ %d\n", dev->name, dev->irq);
+		pr_err("%s: unable to get IRQ %d\n", dev->name, dev->irq);
 	return err;
 }
 

^ permalink raw reply

* Re: [PATCH] cls_u32: use skb_copy_bits() to dereference data safely
From: jamal @ 2010-06-01 12:34 UTC (permalink / raw)
  To: Changli Gao; +Cc: David S. Miller, netdev
In-Reply-To: <1275272665-19047-1-git-send-email-xiaosuo@gmail.com>

Hi Changli,

On Mon, 2010-05-31 at 10:24 +0800, Changli Gao wrote:
> use skb_copy_bits() to dereference data safely
> 
> the original skb->data dereference isn't safe, as there isn't any skb->len or
> skb_is_nonlinear() check. 

I dont see any safety issue in current code in this respect. Do you have
a specific scenario where this would be unsafe? We inspect in 32 bit
chunks walking the packet data and stop when there is no more packet
data.

> skb_copy_bits() is used instead in this patch. And
> when the skb isn't long enough, we terminate the function u32_classify()
> immediately with -1.
> 

That could be a very interesting optimization - but i see it as _very
hard_ to do with current u32 given it has branching and different
branches would have different lengths etc. You'd have to essentially do
some math in user space as to what min length would suffice given
a specified filter and pass that to the kernel. 


cheers,
jamal



^ permalink raw reply

* Re: [Regression] [2.6.35-rc1] ssb_is_sprom_available
From: John W. Linville @ 2010-06-01 13:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Maciej Rutecki, linux-kernel@vger.kernel.org, linux-pci,
	linux-usb, Rafael J. Wysocki, netdev, Michael Buesch
In-Reply-To: <20100531205300.GB32708@parisc-linux.org>

On Mon, May 31, 2010 at 02:53:00PM -0600, Matthew Wilcox wrote:
> On Mon, May 31, 2010 at 09:55:20PM +0200, Maciej Rutecki wrote:
> > Last known good: 2.6.34
> > Failing kernel: 2.6.35-rc1
> > 
> > subsystem: PCI, USB(?)
> > 
> > Kernel dies during booting on message "ssb_is_sprom_available", see picture:
> > http://www.unixy.pl/maciek/download/kernel/2.6.35-rc1/gumis/DSC_0011.JPG
> 
> Um, looks like it's something to do with the Sonics Silicon Backplane,
> not PCI, nor USB.

Fix is on its way...

commit da1fdb02d9200ff28b6f3a380d21930335fe5429
Author: Christoph Fritz <chf.fritz@googlemail.com>
Date:   Fri May 28 10:45:59 2010 +0200

    ssb: fix NULL ptr deref when pcihost_wrapper is used
    
    Ethernet driver b44 does register ssb by it's pcihost_wrapper
    and doesn't set ssb_chipcommon. A check on this value
    introduced with commit d53cdbb94a52a920d5420ed64d986c3523a56743
    and ea2db495f92ad2cf3301623e60cb95b4062bc484 triggers:
    
    BUG: unable to handle kernel NULL pointer dereference at 00000010
    IP: [<c1266c36>] ssb_is_sprom_available+0x16/0x30
    
    Signed-off-by: Christoph Fritz <chf.fritz@googlemail.com>
    Signed-off-by: John W. Linville <linville@tuxdriver.com>

-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply

* Re: Kernel panic with the b44 driver in 2.6.35-rc1
From: John W. Linville @ 2010-06-01 13:33 UTC (permalink / raw)
  To: news.gmane.org; +Cc: netdev
In-Reply-To: <4C0414F3.8030409@tvcablenet.be>

On Mon, May 31, 2010 at 09:58:43PM +0200, news.gmane.org wrote:

> I have a kernel panic at startup with the b44 driver with the 2.6.35-rc1
> kernel. I have submitted a bug report (see
> https://bugzilla.kernel.org/show_bug.cgi?id=16074) but until know, I
> din't get any answer yet. Does anybody knows what's happening. I have
> tried several git-bisect run but I didn't find any conclusive result.
> The first bad-commit was different each time and if I revert it, the
> problem still occurs anyway.

This patch is already on its way to Linus -- it should fix the problem
you are seeing:

commit da1fdb02d9200ff28b6f3a380d21930335fe5429
Author: Christoph Fritz <chf.fritz@googlemail.com>
Date:   Fri May 28 10:45:59 2010 +0200

    ssb: fix NULL ptr deref when pcihost_wrapper is used
    
    Ethernet driver b44 does register ssb by it's pcihost_wrapper
    and doesn't set ssb_chipcommon. A check on this value
    introduced with commit d53cdbb94a52a920d5420ed64d986c3523a56743
    and ea2db495f92ad2cf3301623e60cb95b4062bc484 triggers:
    
    BUG: unable to handle kernel NULL pointer dereference at 00000010
    IP: [<c1266c36>] ssb_is_sprom_available+0x16/0x30
    
    Signed-off-by: Christoph Fritz <chf.fritz@googlemail.com>
    Signed-off-by: John W. Linville <linville@tuxdriver.com>

-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply

* Re: [PATCH 1/3] vhost: replace vhost_workqueue with per-vhost kthread
From: Paul E. McKenney @ 2010-06-01 14:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Tejun Heo, Oleg Nesterov, Sridhar Samudrala, netdev, lkml,
	kvm@vger.kernel.org, Andrew Morton, Dmitri Vorobiev, Jiri Kosina,
	Thomas Gleixner, Ingo Molnar, Andi Kleen
In-Reply-To: <20100531152221.GB2987@redhat.com>

On Mon, May 31, 2010 at 06:22:21PM +0300, Michael S. Tsirkin wrote:
> On Sun, May 30, 2010 at 10:24:01PM +0200, Tejun Heo wrote:
> > Replace vhost_workqueue with per-vhost kthread.  Other than callback
> > argument change from struct work_struct * to struct vhost_poll *,
> > there's no visible change to vhost_poll_*() interface.
> 
> I would prefer a substructure vhost_work, even just to make
> the code easier to review and compare to workqueue.c.

Either way this plays out, the rcu_dereference_check() calls will need
to be adjusted to reflect the change.

							Thanx, Paul

> > This conversion is to make each vhost use a dedicated kthread so that
> > resource control via cgroup can be applied.
> > 
> > Partially based on Sridhar Samudrala's patch.
> > 
> > Cc: Michael S. Tsirkin <mst@redhat.com>
> > Cc: Sridhar Samudrala <samudrala.sridhar@gmail.com>
> > ---
> > Okay, here is three patch series to convert vhost to use per-vhost
> > kthread, add cgroup_attach_task_current_cg() and apply it to the vhost
> > kthreads.  The conversion is mostly straight forward although flush is
> > slightly tricky.
> > 
> > The problem is that I have no idea how to test this.
> 
> It's a 3 step process:
> 
> 1. 
> Install qemu-kvm under fc13, or build recent one from source,
> get it from here:
> git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git
> 
> 2. install guest under it:
> qemu-img create -f qcow2 disk.qcow2 100G
> Now get some image (e.g. fedora 13 in fc13.iso)
> and install guest:
> qemu-kvm -enable-kvm -m 1G -cdrom fc13.iso -drive file=disk.qcow2
> 
> 
> 3. set up networking. I usually simply do host to guest 
> on a special subnet for testing purposes:
> 
> Set up a bridge named mstbr0:
> 
> ifconfig mstbr0 down
> brctl delbr mstbr0
> brctl addbr mstbr0
> brctl setfd mstbr0 0
> ifconfig mstbr0 11.0.0.1
> 
> cat > ifup << EOF
> #!/bin/sh -x
> /sbin/ifconfig msttap0 0.0.0.0 up
> brctl addif mstbr0 msttap0
> EOF
> 
> 
> qemu-kvm -enable-kvm -m 1G -cdrom fc13.iso -drive file=disk.qcow2
>  -net nic,model=virtio,netdev=foo -netdev
> tap,id=foo,ifname=msttap0,script=/home/mst/ifup,downscript=no,vhost=on
> 
> after you set up the guest, log into it and
> ifconfig eth0 11.0.0.2
> 
> You should now be able to ping guest to host and back.
> Use something like netperf to stress the connection.
> Close qemu with kill -9 and unload module to test flushing code.
> 
> 
> 
> > Index: work/drivers/vhost/vhost.c
> > ===================================================================
> > --- work.orig/drivers/vhost/vhost.c
> > +++ work/drivers/vhost/vhost.c
> 
> ...
> 
> > @@ -125,10 +139,50 @@ static void vhost_vq_reset(struct vhost_
> >  	vq->log_ctx = NULL;
> >  }
> > 
> > +static int vhost_poller(void *data)
> > +{
> > +	struct vhost_dev *dev = data;
> > +	struct vhost_poll *poll;
> > +
> > +repeat:
> > +	set_current_state(TASK_INTERRUPTIBLE);	/* mb paired w/ kthread_stop */
> > +
> > +	if (kthread_should_stop()) {
> > +		__set_current_state(TASK_RUNNING);
> > +		return 0;
> > +	}
> > +
> > +	poll = NULL;
> > +	spin_lock(&dev->poller_lock);
> > +	if (!list_empty(&dev->poll_list)) {
> > +		poll = list_first_entry(&dev->poll_list,
> > +					struct vhost_poll, node);
> > +		list_del_init(&poll->node);
> > +	}
> > +	spin_unlock(&dev->poller_lock);
> > +
> > +	if (poll) {
> > +		__set_current_state(TASK_RUNNING);
> > +		poll->fn(poll);
> > +		smp_wmb();	/* paired with rmb in vhost_poll_flush() */
> > +		poll->done_seq = poll->queue_seq;
> > +		wake_up_all(&poll->done);
> 
> 
> This seems to add wakeups on data path, which uses spinlocks etc.
> OTOH workqueue.c adds a special barrier
> entry which only does a wakeup when needed.
> Right?
> 
> > +	} else
> > +		schedule();
> > +
> > +	goto repeat;
> > +}
> > +
> >  long vhost_dev_init(struct vhost_dev *dev,
> >  		    struct vhost_virtqueue *vqs, int nvqs)
> >  {
> > +	struct task_struct *poller;
> >  	int i;
> > +
> > +	poller = kthread_create(vhost_poller, dev, "vhost-%d", current->pid);
> > +	if (IS_ERR(poller))
> > +		return PTR_ERR(poller);
> > +
> >  	dev->vqs = vqs;
> >  	dev->nvqs = nvqs;
> >  	mutex_init(&dev->mutex);
> > @@ -136,6 +190,9 @@ long vhost_dev_init(struct vhost_dev *de
> >  	dev->log_file = NULL;
> >  	dev->memory = NULL;
> >  	dev->mm = NULL;
> > +	spin_lock_init(&dev->poller_lock);
> > +	INIT_LIST_HEAD(&dev->poll_list);
> > +	dev->poller = poller;
> > 
> >  	for (i = 0; i < dev->nvqs; ++i) {
> >  		dev->vqs[i].dev = dev;
> > @@ -143,8 +200,7 @@ long vhost_dev_init(struct vhost_dev *de
> >  		vhost_vq_reset(dev, dev->vqs + i);
> >  		if (dev->vqs[i].handle_kick)
> >  			vhost_poll_init(&dev->vqs[i].poll,
> > -					dev->vqs[i].handle_kick,
> > -					POLLIN);
> > +					dev->vqs[i].handle_kick, POLLIN, dev);
> >  	}
> >  	return 0;
> >  }
> > @@ -217,6 +273,8 @@ void vhost_dev_cleanup(struct vhost_dev
> >  	if (dev->mm)
> >  		mmput(dev->mm);
> >  	dev->mm = NULL;
> > +
> > +	kthread_stop(dev->poller);
> >  }
> > 
> >  static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
> > @@ -1113,16 +1171,3 @@ void vhost_disable_notify(struct vhost_v
> >  		vq_err(vq, "Failed to enable notification at %p: %d\n",
> >  		       &vq->used->flags, r);
> >  }
> > -
> > -int vhost_init(void)
> > -{
> > -	vhost_workqueue = create_singlethread_workqueue("vhost");
> > -	if (!vhost_workqueue)
> > -		return -ENOMEM;
> > -	return 0;
> > -}
> > -
> > -void vhost_cleanup(void)
> > -{
> > -	destroy_workqueue(vhost_workqueue);
> 
> I note that destroy_workqueue does a flush, kthread_stop
> doesn't. Right? Sure we don't need to check nothing is in one of
> the lists? Maybe add a BUG_ON?
> 
> > -}
> > Index: work/drivers/vhost/vhost.h
> > ===================================================================
> > --- work.orig/drivers/vhost/vhost.h
> > +++ work/drivers/vhost/vhost.h
> > @@ -5,7 +5,6 @@
> >  #include <linux/vhost.h>
> >  #include <linux/mm.h>
> >  #include <linux/mutex.h>
> > -#include <linux/workqueue.h>
> >  #include <linux/poll.h>
> >  #include <linux/file.h>
> >  #include <linux/skbuff.h>
> > @@ -20,19 +19,26 @@ enum {
> >  	VHOST_NET_MAX_SG = MAX_SKB_FRAGS + 2,
> >  };
> > 
> > +struct vhost_poll;
> > +typedef void (*vhost_poll_fn_t)(struct vhost_poll *poll);
> > +
> >  /* Poll a file (eventfd or socket) */
> >  /* Note: there's nothing vhost specific about this structure. */
> >  struct vhost_poll {
> > +	vhost_poll_fn_t		  fn;
> >  	poll_table                table;
> >  	wait_queue_head_t        *wqh;
> >  	wait_queue_t              wait;
> > -	/* struct which will handle all actual work. */
> > -	struct work_struct        work;
> > +	struct list_head	  node;
> > +	wait_queue_head_t	  done;
> >  	unsigned long		  mask;
> > +	struct vhost_dev	 *dev;
> > +	int			  queue_seq;
> > +	int			  done_seq;
> >  };
> > 
> > -void vhost_poll_init(struct vhost_poll *poll, work_func_t func,
> > -		     unsigned long mask);
> > +void vhost_poll_init(struct vhost_poll *poll, vhost_poll_fn_t fn,
> > +		     unsigned long mask, struct vhost_dev *dev);
> >  void vhost_poll_start(struct vhost_poll *poll, struct file *file);
> >  void vhost_poll_stop(struct vhost_poll *poll);
> >  void vhost_poll_flush(struct vhost_poll *poll);
> > @@ -63,7 +69,7 @@ struct vhost_virtqueue {
> >  	struct vhost_poll poll;
> > 
> >  	/* The routine to call when the Guest pings us, or timeout. */
> > -	work_func_t handle_kick;
> > +	vhost_poll_fn_t handle_kick;
> > 
> >  	/* Last available index we saw. */
> >  	u16 last_avail_idx;
> > @@ -86,11 +92,11 @@ struct vhost_virtqueue {
> >  	struct iovec hdr[VHOST_NET_MAX_SG];
> >  	size_t hdr_size;
> >  	/* We use a kind of RCU to access private pointer.
> > -	 * All readers access it from workqueue, which makes it possible to
> > -	 * flush the workqueue instead of synchronize_rcu. Therefore readers do
> > +	 * All readers access it from poller, which makes it possible to
> > +	 * flush the vhost_poll instead of synchronize_rcu. Therefore readers do
> >  	 * not need to call rcu_read_lock/rcu_read_unlock: the beginning of
> > -	 * work item execution acts instead of rcu_read_lock() and the end of
> > -	 * work item execution acts instead of rcu_read_lock().
> > +	 * vhost_poll execution acts instead of rcu_read_lock() and the end of
> > +	 * vhost_poll execution acts instead of rcu_read_lock().
> >  	 * Writers use virtqueue mutex. */
> >  	void *private_data;
> >  	/* Log write descriptors */
> > @@ -110,6 +116,9 @@ struct vhost_dev {
> >  	int nvqs;
> >  	struct file *log_file;
> >  	struct eventfd_ctx *log_ctx;
> > +	spinlock_t poller_lock;
> > +	struct list_head poll_list;
> > +	struct task_struct *poller;
> >  };
> > 
> >  long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nvqs);
> > @@ -136,9 +145,6 @@ bool vhost_enable_notify(struct vhost_vi
> >  int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
> >  		    unsigned int log_num, u64 len);
> > 
> > -int vhost_init(void);
> > -void vhost_cleanup(void);
> > -
> >  #define vq_err(vq, fmt, ...) do {                                  \
> >  		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
> >  		if ((vq)->error_ctx)                               \
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply

* Re: Fore 200 firmware
From: Chas Williams (CONTRACTOR) @ 2010-06-01 14:37 UTC (permalink / raw)
  To: Meelis Roos; +Cc: netdev
In-Reply-To: <alpine.SOC.1.00.1005311058180.2805@math.ut.ee>

In message <alpine.SOC.1.00.1005311058180.2805@math.ut.ee>,Meelis Roos writes:
>So the current state is that drivers/atm contains the C dump of fore 200 
>firmware in drivers/atm/fore200e_pca_fw.c but this firmware is not built 
>and not installed by make install. It's also not present in 
>linux-firware tree/packages so the fore driver can not load its firmware 
>from anywhere. I guess this is not the intended outcome.

this firmware is in the linux-atm source tree now.  the 2.5.1 release
should have the firmware.

http://sourceforge.net/projects/linux-atm/files/

^ permalink raw reply

* Re: [PATCH] greth: Fix build after OF device conversions.
From: Grant Likely @ 2010-06-01 14:46 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20100601.001059.241461551.davem@davemloft.net>

On Tue, Jun 1, 2010 at 1:10 AM, David Miller <davem@davemloft.net> wrote:
>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> ---
>
> I've pushed this to net-2.6

Thanks David.

g.

>
>  drivers/net/greth.c |   11 +++++------
>  1 files changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/net/greth.c b/drivers/net/greth.c
> index f37a4c1..3a029d0 100644
> --- a/drivers/net/greth.c
> +++ b/drivers/net/greth.c
> @@ -1607,14 +1607,13 @@ static struct of_device_id greth_of_match[] = {
>  MODULE_DEVICE_TABLE(of, greth_of_match);
>
>  static struct of_platform_driver greth_of_driver = {
> -       .name = "grlib-greth",
> -       .match_table = greth_of_match,
> +       .driver = {
> +               .name = "grlib-greth",
> +               .owner = THIS_MODULE,
> +               .of_match_table = greth_of_match,
> +       },
>        .probe = greth_of_probe,
>        .remove = __devexit_p(greth_of_remove),
> -       .driver = {
> -                  .owner = THIS_MODULE,
> -                  .name = "grlib-greth",
> -                  },
>  };
>
>  static int __init greth_init(void)
> --
> 1.7.0.4
>
>



-- 
Grant Likely, B.Sc., P.Eng.
Secret Lab Technologies Ltd.

^ permalink raw reply

* Re: Fore 200 firmware
From: Meelis Roos @ 2010-06-01 14:52 UTC (permalink / raw)
  To: chas3; +Cc: netdev
In-Reply-To: <201006011437.o51Ebp6f014332@thirdoffive.cmf.nrl.navy.mil>

> >So the current state is that drivers/atm contains the C dump of fore 200 
> >firmware in drivers/atm/fore200e_pca_fw.c but this firmware is not built 
> >and not installed by make install. It's also not present in 
> >linux-firware tree/packages so the fore driver can not load its firmware 
> >from anywhere. I guess this is not the intended outcome.
> 
> this firmware is in the linux-atm source tree now.  the 2.5.1 release
> should have the firmware.

2.5.1 does not seem to contain the firmware.

Even if it will be there in the next release, we should still do 
something to distribute it to the user so that any modern distro has the 
firmware available (in linux-firmware or via similar means) so the user 
does not have to redo the detective work I am doing. Packaging it along 
with linux-atm packages (atm-tools in Debian) does not seem correct, so 
do we need to make a special firmware package out of it or just add to 
some other firmware repo that is packaged already?

What's the licence of the firmware anyway - unknown likely?

And we can probably remove drivers/atm/fore200e_pca_fw.c if we 
distribute the firmware out of kernel tree.

-- 
Meelis Roos (mroos@linux.ee)

^ permalink raw reply

* net: replace hooks in __netif_receive_skb (v4)
From: Stephen Hemminger @ 2010-06-01 15:28 UTC (permalink / raw)
  To: Jiri Pirko, davem; +Cc: netdev, kaber, eric.dumazet
In-Reply-To: <20100528073345.GD2823@psychotron.redhat.com>


From: Jiri Pirko <jpirko@redhat.com>

What this patch does is it removes two receive frame hooks (for bridge and for
macvlan) from __netif_receive_skb. These are replaced them with a single
hook for both. It only supports one hook per device because it makes no
sense to do bridging and macvlan on the same device.

Then a network driver (of virtual netdev like macvlan or bridge) can register
an rx_handler for needed net device.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
---
v3->v4 
       - only one hook per device.
v2->v3
	- used GPL-ed exports
v1->v2
	- writers are locked by rtnl_lock (removed unnecessary spinlock)
	- using call_rcu in unregister
	- nicer macvlan_port_create cleanup
	- struct rx_hanler is made const in funtion parameters

 drivers/net/macvlan.c      |   19 ++++---
 include/linux/if_bridge.h  |    2 
 include/linux/if_macvlan.h |    4 -
 include/linux/netdevice.h  |    8 +++
 net/bridge/br.c            |    2 
 net/bridge/br_if.c         |    8 +++
 net/bridge/br_input.c      |   12 +++-
 net/bridge/br_private.h    |    3 -
 net/core/dev.c             |  119 ++++++++++++++++++++-------------------------
 9 files changed, 94 insertions(+), 83 deletions(-)

--- a/drivers/net/macvlan.c	2010-05-28 08:41:38.248169422 -0700
+++ b/drivers/net/macvlan.c	2010-06-01 08:16:52.307206412 -0700
@@ -145,15 +145,16 @@ static void macvlan_broadcast(struct sk_
 }
 
 /* called under rcu_read_lock() from netif_receive_skb */
-static struct sk_buff *macvlan_handle_frame(struct macvlan_port *port,
-					    struct sk_buff *skb)
+static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
 {
+	struct macvlan_port *port;
 	const struct ethhdr *eth = eth_hdr(skb);
 	const struct macvlan_dev *vlan;
 	const struct macvlan_dev *src;
 	struct net_device *dev;
 	unsigned int len;
 
+	port = rcu_dereference(skb->dev->macvlan_port);
 	if (is_multicast_ether_addr(eth->h_dest)) {
 		src = macvlan_hash_lookup(port, eth->h_source);
 		if (!src)
@@ -515,6 +516,7 @@ static int macvlan_port_create(struct ne
 {
 	struct macvlan_port *port;
 	unsigned int i;
+	int err;
 
 	if (dev->type != ARPHRD_ETHER || dev->flags & IFF_LOOPBACK)
 		return -EINVAL;
@@ -528,13 +530,21 @@ static int macvlan_port_create(struct ne
 	for (i = 0; i < MACVLAN_HASH_SIZE; i++)
 		INIT_HLIST_HEAD(&port->vlan_hash[i]);
 	rcu_assign_pointer(dev->macvlan_port, port);
-	return 0;
+
+	err = netdev_rx_handler_register(dev, macvlan_handle_frame);
+	if (err) {
+		dev->macvlan_port = NULL;
+		kfree(port);
+	}
+
+	return err;
 }
 
 static void macvlan_port_destroy(struct net_device *dev)
 {
 	struct macvlan_port *port = dev->macvlan_port;
 
+	netdev_rx_handler_unregister(dev);
 	rcu_assign_pointer(dev->macvlan_port, NULL);
 	synchronize_rcu();
 	kfree(port);
@@ -767,14 +777,12 @@ static int __init macvlan_init_module(vo
 	int err;
 
 	register_netdevice_notifier(&macvlan_notifier_block);
-	macvlan_handle_frame_hook = macvlan_handle_frame;
 
 	err = macvlan_link_register(&macvlan_link_ops);
 	if (err < 0)
 		goto err1;
 	return 0;
 err1:
-	macvlan_handle_frame_hook = NULL;
 	unregister_netdevice_notifier(&macvlan_notifier_block);
 	return err;
 }
@@ -782,7 +790,6 @@ err1:
 static void __exit macvlan_cleanup_module(void)
 {
 	rtnl_link_unregister(&macvlan_link_ops);
-	macvlan_handle_frame_hook = NULL;
 	unregister_netdevice_notifier(&macvlan_notifier_block);
 }
 
--- a/include/linux/if_bridge.h	2010-05-28 08:35:11.628190230 -0700
+++ b/include/linux/if_bridge.h	2010-06-01 08:02:39.859736930 -0700
@@ -102,8 +102,6 @@ struct __fdb_entry {
 #include <linux/netdevice.h>
 
 extern void brioctl_set(int (*ioctl_hook)(struct net *, unsigned int, void __user *));
-extern struct sk_buff *(*br_handle_frame_hook)(struct net_bridge_port *p,
-					       struct sk_buff *skb);
 extern int (*br_should_route_hook)(struct sk_buff *skb);
 
 #endif
--- a/include/linux/if_macvlan.h	2010-05-28 08:35:11.628190230 -0700
+++ b/include/linux/if_macvlan.h	2010-06-01 08:02:39.869738723 -0700
@@ -84,8 +84,4 @@ extern int macvlan_link_register(struct 
 extern netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
 				      struct net_device *dev);
 
-
-extern struct sk_buff *(*macvlan_handle_frame_hook)(struct macvlan_port *,
-						    struct sk_buff *);
-
 #endif /* _LINUX_IF_MACVLAN_H */
--- a/include/linux/netdevice.h	2010-05-28 08:41:38.628169375 -0700
+++ b/include/linux/netdevice.h	2010-06-01 08:12:59.680035275 -0700
@@ -254,6 +254,7 @@ struct netdev_hw_addr_list {
 #define netdev_for_each_mc_addr(ha, dev) \
 	netdev_hw_addr_list_for_each(ha, &(dev)->mc)
 
+
 struct hh_cache {
 	struct hh_cache *hh_next;	/* Next entry			     */
 	atomic_t	hh_refcnt;	/* number of users                   */
@@ -381,6 +382,8 @@ enum gro_result {
 };
 typedef enum gro_result gro_result_t;
 
+typedef struct sk_buff *rx_callback_func_t(struct sk_buff *skb);
+
 extern void __napi_schedule(struct napi_struct *n);
 
 static inline int napi_disable_pending(struct napi_struct *n)
@@ -957,6 +960,7 @@ struct net_device {
 #endif
 
 	struct netdev_queue	rx_queue;
+	rx_callback_func_t	*rx_handler;
 
 	struct netdev_queue	*_tx ____cacheline_aligned_in_smp;
 
@@ -1693,6 +1697,10 @@ static inline void napi_free_frags(struc
 	napi->skb = NULL;
 }
 
+extern int netdev_rx_handler_register(struct net_device *dev,
+				      rx_callback_func_t *hook);
+extern void netdev_rx_handler_unregister(struct net_device *dev);
+
 extern void		netif_nit_deliver(struct sk_buff *skb);
 extern int		dev_valid_name(const char *name);
 extern int		dev_ioctl(struct net *net, unsigned int cmd, void __user *);
--- a/net/bridge/br.c	2010-05-28 08:35:11.858189945 -0700
+++ b/net/bridge/br.c	2010-06-01 08:02:39.889741050 -0700
@@ -63,7 +63,6 @@ static int __init br_init(void)
 		goto err_out4;
 
 	brioctl_set(br_ioctl_deviceless_stub);
-	br_handle_frame_hook = br_handle_frame;
 
 #if defined(CONFIG_ATM_LANE) || defined(CONFIG_ATM_LANE_MODULE)
 	br_fdb_test_addr_hook = br_fdb_test_addr;
@@ -100,7 +99,6 @@ static void __exit br_deinit(void)
 	br_fdb_test_addr_hook = NULL;
 #endif
 
-	br_handle_frame_hook = NULL;
 	br_fdb_fini();
 }
 
--- a/net/bridge/br_if.c	2010-05-28 08:35:11.858189945 -0700
+++ b/net/bridge/br_if.c	2010-06-01 08:14:30.521680943 -0700
@@ -147,6 +147,7 @@ static void del_nbp(struct net_bridge_po
 
 	list_del_rcu(&p->list);
 
+	netdev_rx_handler_unregister(dev);
 	rcu_assign_pointer(dev->br_port, NULL);
 
 	br_multicast_del_port(p);
@@ -429,6 +430,11 @@ int br_add_if(struct net_bridge *br, str
 		goto err2;
 
 	rcu_assign_pointer(dev->br_port, p);
+
+	err = netdev_rx_handler_register(dev, br_handle_frame);
+	if (err)
+		goto err3;
+
 	dev_disable_lro(dev);
 
 	list_add_rcu(&p->list, &br->port_list);
@@ -451,6 +457,8 @@ int br_add_if(struct net_bridge *br, str
 	br_netpoll_enable(br, dev);
 
 	return 0;
+err3:
+	dev->br_port = NULL;
 err2:
 	br_fdb_delete_by_port(br, p, 1);
 err1:
--- a/net/bridge/br_input.c	2010-05-28 08:35:11.858189945 -0700
+++ b/net/bridge/br_input.c	2010-06-01 08:02:39.909740447 -0700
@@ -131,15 +131,19 @@ static inline int is_link_local(const un
 }
 
 /*
- * Called via br_handle_frame_hook.
  * Return NULL if skb is handled
- * note: already called with rcu_read_lock (preempt_disabled)
+ * note: already called with rcu_read_lock (preempt_disabled) from
+ * netif_receive_skb
  */
-struct sk_buff *br_handle_frame(struct net_bridge_port *p, struct sk_buff *skb)
+struct sk_buff *br_handle_frame(struct sk_buff *skb)
 {
+	struct net_bridge_port *p;
 	const unsigned char *dest = eth_hdr(skb)->h_dest;
 	int (*rhook)(struct sk_buff *skb);
 
+	if (skb->pkt_type == PACKET_LOOPBACK)
+		return skb;
+
 	if (!is_valid_ether_addr(eth_hdr(skb)->h_source))
 		goto drop;
 
@@ -147,6 +151,8 @@ struct sk_buff *br_handle_frame(struct n
 	if (!skb)
 		return NULL;
 
+	p = rcu_dereference(skb->dev->br_port);
+
 	if (unlikely(is_link_local(dest))) {
 		/* Pause frames shouldn't be passed up by driver anyway */
 		if (skb->protocol == htons(ETH_P_PAUSE))
--- a/net/bridge/br_private.h	2010-05-28 08:35:11.858189945 -0700
+++ b/net/bridge/br_private.h	2010-06-01 08:02:39.919725220 -0700
@@ -331,8 +331,7 @@ extern void br_features_recompute(struct
 
 /* br_input.c */
 extern int br_handle_frame_finish(struct sk_buff *skb);
-extern struct sk_buff *br_handle_frame(struct net_bridge_port *p,
-				       struct sk_buff *skb);
+extern struct sk_buff *br_handle_frame(struct sk_buff *skb);
 
 /* br_ioctl.c */
 extern int br_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd);
--- a/net/core/dev.c	2010-05-28 08:41:38.678169590 -0700
+++ b/net/core/dev.c	2010-06-01 08:16:38.563051750 -0700
@@ -2581,70 +2581,14 @@ static inline int deliver_skb(struct sk_
 	return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
 }
 
-#if defined(CONFIG_BRIDGE) || defined (CONFIG_BRIDGE_MODULE)
-
-#if defined(CONFIG_ATM_LANE) || defined(CONFIG_ATM_LANE_MODULE)
+#if (defined(CONFIG_BRIDGE) || defined(CONFIG_BRIDGE_MODULE)) && \
+    (defined(CONFIG_ATM_LANE) || defined(CONFIG_ATM_LANE_MODULE))
 /* This hook is defined here for ATM LANE */
 int (*br_fdb_test_addr_hook)(struct net_device *dev,
 			     unsigned char *addr) __read_mostly;
 EXPORT_SYMBOL_GPL(br_fdb_test_addr_hook);
 #endif
 
-/*
- * If bridge module is loaded call bridging hook.
- *  returns NULL if packet was consumed.
- */
-struct sk_buff *(*br_handle_frame_hook)(struct net_bridge_port *p,
-					struct sk_buff *skb) __read_mostly;
-EXPORT_SYMBOL_GPL(br_handle_frame_hook);
-
-static inline struct sk_buff *handle_bridge(struct sk_buff *skb,
-					    struct packet_type **pt_prev, int *ret,
-					    struct net_device *orig_dev)
-{
-	struct net_bridge_port *port;
-
-	if (skb->pkt_type == PACKET_LOOPBACK ||
-	    (port = rcu_dereference(skb->dev->br_port)) == NULL)
-		return skb;
-
-	if (*pt_prev) {
-		*ret = deliver_skb(skb, *pt_prev, orig_dev);
-		*pt_prev = NULL;
-	}
-
-	return br_handle_frame_hook(port, skb);
-}
-#else
-#define handle_bridge(skb, pt_prev, ret, orig_dev)	(skb)
-#endif
-
-#if defined(CONFIG_MACVLAN) || defined(CONFIG_MACVLAN_MODULE)
-struct sk_buff *(*macvlan_handle_frame_hook)(struct macvlan_port *p,
-					     struct sk_buff *skb) __read_mostly;
-EXPORT_SYMBOL_GPL(macvlan_handle_frame_hook);
-
-static inline struct sk_buff *handle_macvlan(struct sk_buff *skb,
-					     struct packet_type **pt_prev,
-					     int *ret,
-					     struct net_device *orig_dev)
-{
-	struct macvlan_port *port;
-
-	port = rcu_dereference(skb->dev->macvlan_port);
-	if (!port)
-		return skb;
-
-	if (*pt_prev) {
-		*ret = deliver_skb(skb, *pt_prev, orig_dev);
-		*pt_prev = NULL;
-	}
-	return macvlan_handle_frame_hook(port, skb);
-}
-#else
-#define handle_macvlan(skb, pt_prev, ret, orig_dev)	(skb)
-#endif
-
 #ifdef CONFIG_NET_CLS_ACT
 /* TODO: Maybe we should just force sch_ingress to be compiled in
  * when CONFIG_NET_CLS_ACT is? otherwise some useless instructions
@@ -2740,6 +2684,48 @@ void netif_nit_deliver(struct sk_buff *s
 	rcu_read_unlock();
 }
 
+/**
+ *	netdev_rx_handler_register - register receive handler
+ *	@dev: device to register a handler for
+ *	@rh: receive handler to register
+ *
+ *	Register a receive hander for a device. This handler will then be
+ *	called from __netif_receive_skb. A negative errno code is returned
+ *	on a failure.
+ *
+ *	The caller must hold the rtnl_mutex.
+ */
+int netdev_rx_handler_register(struct net_device *dev,
+			       rx_callback_func_t *hook)
+{
+	ASSERT_RTNL();
+
+	if (dev->rx_handler)
+		return -EBUSY;
+
+	rcu_assign_pointer(dev->rx_handler, hook);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(netdev_rx_handler_register);
+
+/**
+ *	netdev_rx_handler_unregister - unregister receive handler
+ *	@dev: device to unregister a handler from
+ *	@rh: receive handler to unregister
+ *
+ *	Unregister a receive hander from a device.
+ *
+ *	The caller must hold the rtnl_mutex.
+ */
+void netdev_rx_handler_unregister(struct net_device *dev)
+{
+
+	ASSERT_RTNL();
+	dev->rx_handler = NULL;
+}
+EXPORT_SYMBOL_GPL(netdev_rx_handler_unregister);
+
 static inline void skb_bond_set_mac_by_master(struct sk_buff *skb,
 					      struct net_device *master)
 {
@@ -2792,6 +2778,7 @@ EXPORT_SYMBOL(__skb_bond_should_drop);
 static int __netif_receive_skb(struct sk_buff *skb)
 {
 	struct packet_type *ptype, *pt_prev;
+	rx_callback_func_t *rh;
 	struct net_device *orig_dev;
 	struct net_device *master;
 	struct net_device *null_or_orig;
@@ -2855,12 +2842,16 @@ static int __netif_receive_skb(struct sk
 ncls:
 #endif
 
-	skb = handle_bridge(skb, &pt_prev, &ret, orig_dev);
-	if (!skb)
-		goto out;
-	skb = handle_macvlan(skb, &pt_prev, &ret, orig_dev);
-	if (!skb)
-		goto out;
+	/* Handle special case of bridge or macvlan */
+	if ((rh = rcu_dereference(skb->dev->rx_handler)) != NULL) {
+		if (pt_prev) {
+			ret = deliver_skb(skb, pt_prev, orig_dev);
+			pt_prev = NULL;
+		}
+		skb = rh(skb);
+		if (!skb)
+			goto out;
+	}
 
 	/*
 	 * Make sure frames received on VLAN interfaces stacked on

^ permalink raw reply

* Re: net: replace hooks in __netif_receive_skb (v4)
From: Jiri Pirko @ 2010-06-01 15:41 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: davem, netdev, kaber, eric.dumazet
In-Reply-To: <20100601082805.1c84b16d@nehalam>

Actually, I'm not happy about this (reducing to only one hook) and for two
reasons:

1) I think it's a good behaviour to "mask" one handler by another in case device
is for example used for macvlan and then added to bridge. Because when it's
again removed from the bridge, the original functionality is restored. And also,
this would be consistent with the current behaviour.

2) I would imagine a situation, when multiple handers are needed in cascade.
Actually I'm working on a virtual device draft which uses two handlers, although
in corner situation.

Regards,
	Jirka

Tue, Jun 01, 2010 at 05:28:05PM CEST, shemminger@vyatta.com wrote:
>
>From: Jiri Pirko <jpirko@redhat.com>
>
>What this patch does is it removes two receive frame hooks (for bridge and for
>macvlan) from __netif_receive_skb. These are replaced them with a single
>hook for both. It only supports one hook per device because it makes no
>sense to do bridging and macvlan on the same device.
>
>Then a network driver (of virtual netdev like macvlan or bridge) can register
>an rx_handler for needed net device.
>
>Signed-off-by: Jiri Pirko <jpirko@redhat.com>
>Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>---
>v3->v4 
>       - only one hook per device.
>v2->v3
>	- used GPL-ed exports
>v1->v2
>	- writers are locked by rtnl_lock (removed unnecessary spinlock)
>	- using call_rcu in unregister
>	- nicer macvlan_port_create cleanup
>	- struct rx_hanler is made const in funtion parameters
>
> drivers/net/macvlan.c      |   19 ++++---
> include/linux/if_bridge.h  |    2 
> include/linux/if_macvlan.h |    4 -
> include/linux/netdevice.h  |    8 +++
> net/bridge/br.c            |    2 
> net/bridge/br_if.c         |    8 +++
> net/bridge/br_input.c      |   12 +++-
> net/bridge/br_private.h    |    3 -
> net/core/dev.c             |  119 ++++++++++++++++++++-------------------------
> 9 files changed, 94 insertions(+), 83 deletions(-)
>
>--- a/drivers/net/macvlan.c	2010-05-28 08:41:38.248169422 -0700
>+++ b/drivers/net/macvlan.c	2010-06-01 08:16:52.307206412 -0700
>@@ -145,15 +145,16 @@ static void macvlan_broadcast(struct sk_
> }
> 
> /* called under rcu_read_lock() from netif_receive_skb */
>-static struct sk_buff *macvlan_handle_frame(struct macvlan_port *port,
>-					    struct sk_buff *skb)
>+static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
> {
>+	struct macvlan_port *port;
> 	const struct ethhdr *eth = eth_hdr(skb);
> 	const struct macvlan_dev *vlan;
> 	const struct macvlan_dev *src;
> 	struct net_device *dev;
> 	unsigned int len;
> 
>+	port = rcu_dereference(skb->dev->macvlan_port);
> 	if (is_multicast_ether_addr(eth->h_dest)) {
> 		src = macvlan_hash_lookup(port, eth->h_source);
> 		if (!src)
>@@ -515,6 +516,7 @@ static int macvlan_port_create(struct ne
> {
> 	struct macvlan_port *port;
> 	unsigned int i;
>+	int err;
> 
> 	if (dev->type != ARPHRD_ETHER || dev->flags & IFF_LOOPBACK)
> 		return -EINVAL;
>@@ -528,13 +530,21 @@ static int macvlan_port_create(struct ne
> 	for (i = 0; i < MACVLAN_HASH_SIZE; i++)
> 		INIT_HLIST_HEAD(&port->vlan_hash[i]);
> 	rcu_assign_pointer(dev->macvlan_port, port);
>-	return 0;
>+
>+	err = netdev_rx_handler_register(dev, macvlan_handle_frame);
>+	if (err) {
>+		dev->macvlan_port = NULL;
>+		kfree(port);
>+	}
>+
>+	return err;
> }
> 
> static void macvlan_port_destroy(struct net_device *dev)
> {
> 	struct macvlan_port *port = dev->macvlan_port;
> 
>+	netdev_rx_handler_unregister(dev);
> 	rcu_assign_pointer(dev->macvlan_port, NULL);
> 	synchronize_rcu();
> 	kfree(port);
>@@ -767,14 +777,12 @@ static int __init macvlan_init_module(vo
> 	int err;
> 
> 	register_netdevice_notifier(&macvlan_notifier_block);
>-	macvlan_handle_frame_hook = macvlan_handle_frame;
> 
> 	err = macvlan_link_register(&macvlan_link_ops);
> 	if (err < 0)
> 		goto err1;
> 	return 0;
> err1:
>-	macvlan_handle_frame_hook = NULL;
> 	unregister_netdevice_notifier(&macvlan_notifier_block);
> 	return err;
> }
>@@ -782,7 +790,6 @@ err1:
> static void __exit macvlan_cleanup_module(void)
> {
> 	rtnl_link_unregister(&macvlan_link_ops);
>-	macvlan_handle_frame_hook = NULL;
> 	unregister_netdevice_notifier(&macvlan_notifier_block);
> }
> 
>--- a/include/linux/if_bridge.h	2010-05-28 08:35:11.628190230 -0700
>+++ b/include/linux/if_bridge.h	2010-06-01 08:02:39.859736930 -0700
>@@ -102,8 +102,6 @@ struct __fdb_entry {
> #include <linux/netdevice.h>
> 
> extern void brioctl_set(int (*ioctl_hook)(struct net *, unsigned int, void __user *));
>-extern struct sk_buff *(*br_handle_frame_hook)(struct net_bridge_port *p,
>-					       struct sk_buff *skb);
> extern int (*br_should_route_hook)(struct sk_buff *skb);
> 
> #endif
>--- a/include/linux/if_macvlan.h	2010-05-28 08:35:11.628190230 -0700
>+++ b/include/linux/if_macvlan.h	2010-06-01 08:02:39.869738723 -0700
>@@ -84,8 +84,4 @@ extern int macvlan_link_register(struct 
> extern netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
> 				      struct net_device *dev);
> 
>-
>-extern struct sk_buff *(*macvlan_handle_frame_hook)(struct macvlan_port *,
>-						    struct sk_buff *);
>-
> #endif /* _LINUX_IF_MACVLAN_H */
>--- a/include/linux/netdevice.h	2010-05-28 08:41:38.628169375 -0700
>+++ b/include/linux/netdevice.h	2010-06-01 08:12:59.680035275 -0700
>@@ -254,6 +254,7 @@ struct netdev_hw_addr_list {
> #define netdev_for_each_mc_addr(ha, dev) \
> 	netdev_hw_addr_list_for_each(ha, &(dev)->mc)
> 
>+
> struct hh_cache {
> 	struct hh_cache *hh_next;	/* Next entry			     */
> 	atomic_t	hh_refcnt;	/* number of users                   */
>@@ -381,6 +382,8 @@ enum gro_result {
> };
> typedef enum gro_result gro_result_t;
> 
>+typedef struct sk_buff *rx_callback_func_t(struct sk_buff *skb);
>+
> extern void __napi_schedule(struct napi_struct *n);
> 
> static inline int napi_disable_pending(struct napi_struct *n)
>@@ -957,6 +960,7 @@ struct net_device {
> #endif
> 
> 	struct netdev_queue	rx_queue;
>+	rx_callback_func_t	*rx_handler;
> 
> 	struct netdev_queue	*_tx ____cacheline_aligned_in_smp;
> 
>@@ -1693,6 +1697,10 @@ static inline void napi_free_frags(struc
> 	napi->skb = NULL;
> }
> 
>+extern int netdev_rx_handler_register(struct net_device *dev,
>+				      rx_callback_func_t *hook);
>+extern void netdev_rx_handler_unregister(struct net_device *dev);
>+
> extern void		netif_nit_deliver(struct sk_buff *skb);
> extern int		dev_valid_name(const char *name);
> extern int		dev_ioctl(struct net *net, unsigned int cmd, void __user *);
>--- a/net/bridge/br.c	2010-05-28 08:35:11.858189945 -0700
>+++ b/net/bridge/br.c	2010-06-01 08:02:39.889741050 -0700
>@@ -63,7 +63,6 @@ static int __init br_init(void)
> 		goto err_out4;
> 
> 	brioctl_set(br_ioctl_deviceless_stub);
>-	br_handle_frame_hook = br_handle_frame;
> 
> #if defined(CONFIG_ATM_LANE) || defined(CONFIG_ATM_LANE_MODULE)
> 	br_fdb_test_addr_hook = br_fdb_test_addr;
>@@ -100,7 +99,6 @@ static void __exit br_deinit(void)
> 	br_fdb_test_addr_hook = NULL;
> #endif
> 
>-	br_handle_frame_hook = NULL;
> 	br_fdb_fini();
> }
> 
>--- a/net/bridge/br_if.c	2010-05-28 08:35:11.858189945 -0700
>+++ b/net/bridge/br_if.c	2010-06-01 08:14:30.521680943 -0700
>@@ -147,6 +147,7 @@ static void del_nbp(struct net_bridge_po
> 
> 	list_del_rcu(&p->list);
> 
>+	netdev_rx_handler_unregister(dev);
> 	rcu_assign_pointer(dev->br_port, NULL);
> 
> 	br_multicast_del_port(p);
>@@ -429,6 +430,11 @@ int br_add_if(struct net_bridge *br, str
> 		goto err2;
> 
> 	rcu_assign_pointer(dev->br_port, p);
>+
>+	err = netdev_rx_handler_register(dev, br_handle_frame);
>+	if (err)
>+		goto err3;
>+
> 	dev_disable_lro(dev);
> 
> 	list_add_rcu(&p->list, &br->port_list);
>@@ -451,6 +457,8 @@ int br_add_if(struct net_bridge *br, str
> 	br_netpoll_enable(br, dev);
> 
> 	return 0;
>+err3:
>+	dev->br_port = NULL;
> err2:
> 	br_fdb_delete_by_port(br, p, 1);
> err1:
>--- a/net/bridge/br_input.c	2010-05-28 08:35:11.858189945 -0700
>+++ b/net/bridge/br_input.c	2010-06-01 08:02:39.909740447 -0700
>@@ -131,15 +131,19 @@ static inline int is_link_local(const un
> }
> 
> /*
>- * Called via br_handle_frame_hook.
>  * Return NULL if skb is handled
>- * note: already called with rcu_read_lock (preempt_disabled)
>+ * note: already called with rcu_read_lock (preempt_disabled) from
>+ * netif_receive_skb
>  */
>-struct sk_buff *br_handle_frame(struct net_bridge_port *p, struct sk_buff *skb)
>+struct sk_buff *br_handle_frame(struct sk_buff *skb)
> {
>+	struct net_bridge_port *p;
> 	const unsigned char *dest = eth_hdr(skb)->h_dest;
> 	int (*rhook)(struct sk_buff *skb);
> 
>+	if (skb->pkt_type == PACKET_LOOPBACK)
>+		return skb;
>+
> 	if (!is_valid_ether_addr(eth_hdr(skb)->h_source))
> 		goto drop;
> 
>@@ -147,6 +151,8 @@ struct sk_buff *br_handle_frame(struct n
> 	if (!skb)
> 		return NULL;
> 
>+	p = rcu_dereference(skb->dev->br_port);
>+
> 	if (unlikely(is_link_local(dest))) {
> 		/* Pause frames shouldn't be passed up by driver anyway */
> 		if (skb->protocol == htons(ETH_P_PAUSE))
>--- a/net/bridge/br_private.h	2010-05-28 08:35:11.858189945 -0700
>+++ b/net/bridge/br_private.h	2010-06-01 08:02:39.919725220 -0700
>@@ -331,8 +331,7 @@ extern void br_features_recompute(struct
> 
> /* br_input.c */
> extern int br_handle_frame_finish(struct sk_buff *skb);
>-extern struct sk_buff *br_handle_frame(struct net_bridge_port *p,
>-				       struct sk_buff *skb);
>+extern struct sk_buff *br_handle_frame(struct sk_buff *skb);
> 
> /* br_ioctl.c */
> extern int br_dev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd);
>--- a/net/core/dev.c	2010-05-28 08:41:38.678169590 -0700
>+++ b/net/core/dev.c	2010-06-01 08:16:38.563051750 -0700
>@@ -2581,70 +2581,14 @@ static inline int deliver_skb(struct sk_
> 	return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
> }
> 
>-#if defined(CONFIG_BRIDGE) || defined (CONFIG_BRIDGE_MODULE)
>-
>-#if defined(CONFIG_ATM_LANE) || defined(CONFIG_ATM_LANE_MODULE)
>+#if (defined(CONFIG_BRIDGE) || defined(CONFIG_BRIDGE_MODULE)) && \
>+    (defined(CONFIG_ATM_LANE) || defined(CONFIG_ATM_LANE_MODULE))
> /* This hook is defined here for ATM LANE */
> int (*br_fdb_test_addr_hook)(struct net_device *dev,
> 			     unsigned char *addr) __read_mostly;
> EXPORT_SYMBOL_GPL(br_fdb_test_addr_hook);
> #endif
> 
>-/*
>- * If bridge module is loaded call bridging hook.
>- *  returns NULL if packet was consumed.
>- */
>-struct sk_buff *(*br_handle_frame_hook)(struct net_bridge_port *p,
>-					struct sk_buff *skb) __read_mostly;
>-EXPORT_SYMBOL_GPL(br_handle_frame_hook);
>-
>-static inline struct sk_buff *handle_bridge(struct sk_buff *skb,
>-					    struct packet_type **pt_prev, int *ret,
>-					    struct net_device *orig_dev)
>-{
>-	struct net_bridge_port *port;
>-
>-	if (skb->pkt_type == PACKET_LOOPBACK ||
>-	    (port = rcu_dereference(skb->dev->br_port)) == NULL)
>-		return skb;
>-
>-	if (*pt_prev) {
>-		*ret = deliver_skb(skb, *pt_prev, orig_dev);
>-		*pt_prev = NULL;
>-	}
>-
>-	return br_handle_frame_hook(port, skb);
>-}
>-#else
>-#define handle_bridge(skb, pt_prev, ret, orig_dev)	(skb)
>-#endif
>-
>-#if defined(CONFIG_MACVLAN) || defined(CONFIG_MACVLAN_MODULE)
>-struct sk_buff *(*macvlan_handle_frame_hook)(struct macvlan_port *p,
>-					     struct sk_buff *skb) __read_mostly;
>-EXPORT_SYMBOL_GPL(macvlan_handle_frame_hook);
>-
>-static inline struct sk_buff *handle_macvlan(struct sk_buff *skb,
>-					     struct packet_type **pt_prev,
>-					     int *ret,
>-					     struct net_device *orig_dev)
>-{
>-	struct macvlan_port *port;
>-
>-	port = rcu_dereference(skb->dev->macvlan_port);
>-	if (!port)
>-		return skb;
>-
>-	if (*pt_prev) {
>-		*ret = deliver_skb(skb, *pt_prev, orig_dev);
>-		*pt_prev = NULL;
>-	}
>-	return macvlan_handle_frame_hook(port, skb);
>-}
>-#else
>-#define handle_macvlan(skb, pt_prev, ret, orig_dev)	(skb)
>-#endif
>-
> #ifdef CONFIG_NET_CLS_ACT
> /* TODO: Maybe we should just force sch_ingress to be compiled in
>  * when CONFIG_NET_CLS_ACT is? otherwise some useless instructions
>@@ -2740,6 +2684,48 @@ void netif_nit_deliver(struct sk_buff *s
> 	rcu_read_unlock();
> }
> 
>+/**
>+ *	netdev_rx_handler_register - register receive handler
>+ *	@dev: device to register a handler for
>+ *	@rh: receive handler to register
>+ *
>+ *	Register a receive hander for a device. This handler will then be
>+ *	called from __netif_receive_skb. A negative errno code is returned
>+ *	on a failure.
>+ *
>+ *	The caller must hold the rtnl_mutex.
>+ */
>+int netdev_rx_handler_register(struct net_device *dev,
>+			       rx_callback_func_t *hook)
>+{
>+	ASSERT_RTNL();
>+
>+	if (dev->rx_handler)
>+		return -EBUSY;
>+
>+	rcu_assign_pointer(dev->rx_handler, hook);
>+
>+	return 0;
>+}
>+EXPORT_SYMBOL_GPL(netdev_rx_handler_register);
>+
>+/**
>+ *	netdev_rx_handler_unregister - unregister receive handler
>+ *	@dev: device to unregister a handler from
>+ *	@rh: receive handler to unregister
>+ *
>+ *	Unregister a receive hander from a device.
>+ *
>+ *	The caller must hold the rtnl_mutex.
>+ */
>+void netdev_rx_handler_unregister(struct net_device *dev)
>+{
>+
>+	ASSERT_RTNL();
>+	dev->rx_handler = NULL;
>+}
>+EXPORT_SYMBOL_GPL(netdev_rx_handler_unregister);
>+
> static inline void skb_bond_set_mac_by_master(struct sk_buff *skb,
> 					      struct net_device *master)
> {
>@@ -2792,6 +2778,7 @@ EXPORT_SYMBOL(__skb_bond_should_drop);
> static int __netif_receive_skb(struct sk_buff *skb)
> {
> 	struct packet_type *ptype, *pt_prev;
>+	rx_callback_func_t *rh;
> 	struct net_device *orig_dev;
> 	struct net_device *master;
> 	struct net_device *null_or_orig;
>@@ -2855,12 +2842,16 @@ static int __netif_receive_skb(struct sk
> ncls:
> #endif
> 
>-	skb = handle_bridge(skb, &pt_prev, &ret, orig_dev);
>-	if (!skb)
>-		goto out;
>-	skb = handle_macvlan(skb, &pt_prev, &ret, orig_dev);
>-	if (!skb)
>-		goto out;
>+	/* Handle special case of bridge or macvlan */
>+	if ((rh = rcu_dereference(skb->dev->rx_handler)) != NULL) {
>+		if (pt_prev) {
>+			ret = deliver_skb(skb, pt_prev, orig_dev);
>+			pt_prev = NULL;
>+		}
>+		skb = rh(skb);
>+		if (!skb)
>+			goto out;
>+	}
> 
> 	/*
> 	 * Make sure frames received on VLAN interfaces stacked on

^ permalink raw reply

* Re: Fore 200 firmware
From: Chas Williams (CONTRACTOR) @ 2010-06-01 15:49 UTC (permalink / raw)
  To: Meelis Roos; +Cc: netdev
In-Reply-To: <alpine.SOC.1.00.1006011747570.22519@math.ut.ee>

In message <alpine.SOC.1.00.1006011747570.22519@math.ut.ee>,Meelis Roos writes:
>2.5.1 does not seem to contain the firmware.

oops -- i didnt commit it to the repo.

>does not have to redo the detective work I am doing. Packaging it along 
>with linux-atm packages (atm-tools in Debian) does not seem correct, so 
>do we need to make a special firmware package out of it or just add to 
>some other firmware repo that is packaged already?

i dont care which happens really.  i would to think it is part of
atm-tools since the fore200e driver really isnt useful with the
atm tools.

>What's the licence of the firmware anyway - unknown likely?

it says the firmware is gpl.  look at fore200e_firmware_copyright
in one of the old kernel distros.

>And we can probably remove drivers/atm/fore200e_pca_fw.c if we 
>distribute the firmware out of kernel tree.

i dont think it is in the current net-next-2.6 tree?

^ permalink raw reply

* [PATCH] ppp: eliminate shadowed variable name
From: Stephen Hemminger @ 2010-06-01 16:05 UTC (permalink / raw)
  To: Paul Mackerras, David Miller; +Cc: netdev

Sparse complains about shadowed declaration of skb. So use other
name.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>


--- a/drivers/net/ppp_generic.c	2010-06-01 08:49:58.859739636 -0700
+++ b/drivers/net/ppp_generic.c	2010-06-01 08:50:41.184870657 -0700
@@ -1927,9 +1927,9 @@ ppp_receive_mp_frame(struct ppp *ppp, st
 	/* If the queue is getting long, don't wait any longer for packets
 	   before the start of the queue. */
 	if (skb_queue_len(&ppp->mrq) >= PPP_MP_MAX_QLEN) {
-		struct sk_buff *skb = skb_peek(&ppp->mrq);
-		if (seq_before(ppp->minseq, skb->sequence))
-			ppp->minseq = skb->sequence;
+		struct sk_buff *mskb = skb_peek(&ppp->mrq);
+		if (seq_before(ppp->minseq, mskb->sequence))
+			ppp->minseq = mskb->sequence;
 	}
 
 	/* Pull completed packets off the queue and receive them. */

^ permalink raw reply

* Re: Fore 200 firmware
From: Meelis Roos @ 2010-06-01 16:09 UTC (permalink / raw)
  To: chas3; +Cc: netdev
In-Reply-To: <201006011549.o51FnZFk028507@thirdoffive.cmf.nrl.navy.mil>

> >does not have to redo the detective work I am doing. Packaging it along 
> >with linux-atm packages (atm-tools in Debian) does not seem correct, so 
> >do we need to make a special firmware package out of it or just add to 
> >some other firmware repo that is packaged already?
> 
> i dont care which happens really.  i would to think it is part of
> atm-tools since the fore200e driver really isnt useful with the
> atm tools.

OK, so we should push the distro maintainers of it to create a suggested 
firmware package or just include the fore firmware unconditionally - 
once it is in released linux-atm.

> >What's the licence of the firmware anyway - unknown likely?
> 
> it says the firmware is gpl.  look at fore200e_firmware_copyright
> in one of the old kernel distros.

Well, there is no source to fulfill the GPL... but there also should be 
no problem redistributing it.

> >And we can probably remove drivers/atm/fore200e_pca_fw.c if we 
> >distribute the firmware out of kernel tree.
> 
> i dont think it is in the current net-next-2.6 tree?

Oops, you are right - I have updated my tree over several years and the 
old files were just not removed from disk. But theay are gone from the 
git, yes.

Thnak yo for your work on ATM - I intend to get a demo ATM network 
running here in my computer museum.

-- 
Meelis Roos (mroos@linux.ee)

^ permalink raw reply

* Re: net: replace hooks in __netif_receive_skb (v4)
From: Stephen Hemminger @ 2010-06-01 16:10 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: davem, netdev, kaber, eric.dumazet
In-Reply-To: <20100601154106.GA4929@psychotron.redhat.com>

On Tue, 1 Jun 2010 17:41:07 +0200
Jiri Pirko <jpirko@redhat.com> wrote:

> Actually, I'm not happy about this (reducing to only one hook) and for two
> reasons:
> 
> 1) I think it's a good behaviour to "mask" one handler by another in case device
> is for example used for macvlan and then added to bridge. Because when it's
> again removed from the bridge, the original functionality is restored. And also,
> this would be consistent with the current behaviour.
> 
> 2) I would imagine a situation, when multiple handers are needed in cascade.
> Actually I'm working on a virtual device draft which uses two handlers, although
> in corner situation.
> 
> Regards,
> 	Jirka

I don't like it because:

1) Adding macvlan to bridge makes no sense. The bridge is already in promicious mode.

2) I don't like to see functionality added when it is not needed.

3) The extra overhead of traversing list causes more cache activity in extreme
   hot path.

Wait and add the multiple handlers when your code is included.

-- 
http://www.extremeprogramming.org/rules/early.html

^ permalink raw reply

* RE: replace hooks in __netif_receive_skb (v4)
From: Fischer, Anna @ 2010-06-01 16:13 UTC (permalink / raw)
  To: Stephen Hemminger, Jiri Pirko, davem@davemloft.net
  Cc: netdev@vger.kernel.org, kaber@trash.net, eric.dumazet@gmail.com
In-Reply-To: <20100601082805.1c84b16d@nehalam>

> Subject: net: replace hooks in __netif_receive_skb (v4)
> 
> 
> From: Jiri Pirko <jpirko@redhat.com>
> 
> What this patch does is it removes two receive frame hooks (for bridge
> and for
> macvlan) from __netif_receive_skb. These are replaced them with a
> single
> hook for both. It only supports one hook per device because it makes no
> sense to do bridging and macvlan on the same device.
> 
> Then a network driver (of virtual netdev like macvlan or bridge) can
> register
> an rx_handler for needed net device.


I think the idea of this is really good, and it has been long required to get rid of the bridging hook and the "hack" for macvlan to get into the network stack. 

However, I wonder, if this is to be used as a generic interface, then why the restriction of only having a single hook per device? Yes, it makes sense to do this for the bridge/macvlan combination, but in general there could be other cases where you would want to allow multiple receivers per device. 


> @@ -2792,6 +2778,7 @@ EXPORT_SYMBOL(__skb_bond_should_drop);
>  static int __netif_receive_skb(struct sk_buff *skb)
>  {
>  	struct packet_type *ptype, *pt_prev;
> +	rx_callback_func_t *rh;
>  	struct net_device *orig_dev;
>  	struct net_device *master;
>  	struct net_device *null_or_orig;
> @@ -2855,12 +2842,16 @@ static int __netif_receive_skb(struct sk
>  ncls:
>  #endif
> 
> -	skb = handle_bridge(skb, &pt_prev, &ret, orig_dev);
> -	if (!skb)
> -		goto out;
> -	skb = handle_macvlan(skb, &pt_prev, &ret, orig_dev);
> -	if (!skb)
> -		goto out;
> +	/* Handle special case of bridge or macvlan */
> +	if ((rh = rcu_dereference(skb->dev->rx_handler)) != NULL) {
> +		if (pt_prev) {
> +			ret = deliver_skb(skb, pt_prev, orig_dev);

What happens with 'ret' here? It is completely ignored, isn't it? Because you assume that there is only a single receiver per device? I think it would be much better to have return codes that indicate whether a packet has been consumed by a receive handler, or whether it is supposed to be processed further. The same actually applies for the packet handlers that are processed a bit further down in netif_receive_skb() - the return code is ignored and so a handler has no control over further processing of the packet.

^ permalink raw reply

* [RFC nf-next-2.6] conntrack: per cpu nf_conntrack_untracked
From: Eric Dumazet @ 2010-06-01 16:20 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Netfilter Developers, netdev
In-Reply-To: <4C04E3E2.7020209@trash.net>

Le mardi 01 juin 2010 à 12:41 +0200, Patrick McHardy a écrit :

> > BTW, I notice nf_conntrack_untracked is incorrectly annotated
> > '__read_mostly'.
> > 
> > It can be written very often :(
> > 
> > Should'nt we special case it and let be really const ?
> 
> That would need quite a bit of special-casing to avoid touching
> the reference counts. So far this is completely hidden, so I'd
> say it just shouldn't be marked __read_mostly. Alternatively we
> can make "untracked" a nfctinfo state.

I tried this suggestion, (a new IP_CT_UNTRACKED ctinfo), over a per_cpu
untracked ct, but its a bit hard.

For example, I cannot find a way to change ctnetlink_conntrack_event() :

	if (ct == &nf_conntrack_untracked)
		return 0;

Maybe this piece of code is not necessary, we should not come here
anyway, or it means several packets could store events for this (shared)
ct ?

Obviously, an IPS_UNTRACKED bit would be much easier to implement.
Would it be acceptable ?


Preliminary patch with IP_CT_UNTRACKED, probably not working at all...

 include/linux/netfilter/nf_conntrack_common.h  |    3 +
 include/net/netfilter/nf_conntrack.h           |   11 +++--
 include/net/netfilter/nf_conntrack_core.h      |    2 
 net/ipv4/netfilter/nf_nat_core.c               |    4 +
 net/ipv4/netfilter/nf_nat_standalone.c         |    2 
 net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c |    4 -
 net/netfilter/nf_conntrack_core.c              |   32 +++++++++------
 net/netfilter/nf_conntrack_netlink.c           |    6 +-
 net/netfilter/xt_CT.c                          |   13 +++---
 net/netfilter/xt_NOTRACK.c                     |    4 -
 net/netfilter/xt_TEE.c                         |    8 +--
 net/netfilter/xt_cluster.c                     |    2 
 net/netfilter/xt_conntrack.c                   |    2 
 net/netfilter/xt_socket.c                      |    2 
 14 files changed, 58 insertions(+), 37 deletions(-)

diff --git a/include/linux/netfilter/nf_conntrack_common.h b/include/linux/netfilter/nf_conntrack_common.h
index 14e6d32..5f7c947 100644
--- a/include/linux/netfilter/nf_conntrack_common.h
+++ b/include/linux/netfilter/nf_conntrack_common.h
@@ -15,6 +15,9 @@ enum ip_conntrack_info {
            IP_CT_DIR_ORIGINAL); may be a retransmission. */
 	IP_CT_NEW,
 
+	/* Untracked */
+	IP_CT_UNTRACKED,
+
 	/* >= this indicates reply direction */
 	IP_CT_IS_REPLY,
 
diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index bde095f..884ade9 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -175,7 +175,7 @@ static inline struct nf_conn *
 nf_ct_get(const struct sk_buff *skb, enum ip_conntrack_info *ctinfo)
 {
 	*ctinfo = skb->nfctinfo;
-	return (struct nf_conn *)skb->nfct;
+	return container_of(skb->nfct, struct nf_conn, ct_general);
 }
 
 /* decrement reference count on a conntrack */
@@ -261,7 +261,7 @@ extern s16 (*nf_ct_nat_offset)(const struct nf_conn *ct,
 			       u32 seq);
 
 /* Fake conntrack entry for untracked connections */
-extern struct nf_conn nf_conntrack_untracked;
+DECLARE_PER_CPU(struct nf_conn, pcpu_nf_conntrack_untracked);
 
 /* Iterate over all conntracks: if iter returns true, it's deleted. */
 extern void
@@ -291,7 +291,12 @@ static inline int nf_ct_is_dying(struct nf_conn *ct)
 
 static inline int nf_ct_is_untracked(const struct sk_buff *skb)
 {
-	return (skb->nfct == &nf_conntrack_untracked.ct_general);
+	return (skb->nfctinfo == IP_CT_UNTRACKED);
+}
+
+static inline int nf_ct_is_tracked(const struct sk_buff *skb)
+{
+	return (skb->nfctinfo != IP_CT_UNTRACKED);
 }
 
 extern int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp);
diff --git a/include/net/netfilter/nf_conntrack_core.h b/include/net/netfilter/nf_conntrack_core.h
index 3d7524f..8dd05ea 100644
--- a/include/net/netfilter/nf_conntrack_core.h
+++ b/include/net/netfilter/nf_conntrack_core.h
@@ -60,7 +60,7 @@ static inline int nf_conntrack_confirm(struct sk_buff *skb)
 	struct nf_conn *ct = (struct nf_conn *)skb->nfct;
 	int ret = NF_ACCEPT;
 
-	if (ct && ct != &nf_conntrack_untracked) {
+	if (ct && nf_ct_is_tracked(skb)) {
 		if (!nf_ct_is_confirmed(ct))
 			ret = __nf_conntrack_confirm(skb);
 		if (likely(ret == NF_ACCEPT))
diff --git a/net/ipv4/netfilter/nf_nat_core.c b/net/ipv4/netfilter/nf_nat_core.c
index 4f8bddb..a797999 100644
--- a/net/ipv4/netfilter/nf_nat_core.c
+++ b/net/ipv4/netfilter/nf_nat_core.c
@@ -719,6 +719,7 @@ static int __init nf_nat_init(void)
 {
 	size_t i;
 	int ret;
+	int cpu;
 
 	need_ipv4_conntrack();
 
@@ -742,7 +743,8 @@ static int __init nf_nat_init(void)
 	spin_unlock_bh(&nf_nat_lock);
 
 	/* Initialize fake conntrack so that NAT will skip it */
-	nf_conntrack_untracked.status |= IPS_NAT_DONE_MASK;
+	for_each_possible_cpu(cpu)
+		per_cpu(pcpu_nf_conntrack_untracked,cpu).status |= IPS_NAT_DONE_MASK;
 
 	l3proto = nf_ct_l3proto_find_get((u_int16_t)AF_INET);
 
diff --git a/net/ipv4/netfilter/nf_nat_standalone.c b/net/ipv4/netfilter/nf_nat_standalone.c
index beb2581..17af2bb 100644
--- a/net/ipv4/netfilter/nf_nat_standalone.c
+++ b/net/ipv4/netfilter/nf_nat_standalone.c
@@ -98,7 +98,7 @@ nf_nat_fn(unsigned int hooknum,
 		return NF_ACCEPT;
 
 	/* Don't try to NAT if this packet is not conntracked */
-	if (ct == &nf_conntrack_untracked)
+	if (ctinfo == IP_CT_UNTRACKED)
 		return NF_ACCEPT;
 
 	nat = nfct_nat(ct);
diff --git a/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c b/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
index 9be8177..b67029c 100644
--- a/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
+++ b/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
@@ -208,8 +208,8 @@ icmpv6_error(struct net *net, struct nf_conn *tmpl,
 	type = icmp6h->icmp6_type - 130;
 	if (type >= 0 && type < sizeof(noct_valid_new) &&
 	    noct_valid_new[type]) {
-		skb->nfct = &nf_conntrack_untracked.ct_general;
-		skb->nfctinfo = IP_CT_NEW;
+		skb->nfct = &__get_cpu_var(pcpu_nf_conntrack_untracked).ct_general;
+		skb->nfctinfo = IP_CT_UNTRACKED;
 		nf_conntrack_get(skb->nfct);
 		return NF_ACCEPT;
 	}
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index eeeb8bc..eea5df1 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -62,8 +62,8 @@ EXPORT_SYMBOL_GPL(nf_conntrack_htable_size);
 unsigned int nf_conntrack_max __read_mostly;
 EXPORT_SYMBOL_GPL(nf_conntrack_max);
 
-struct nf_conn nf_conntrack_untracked __read_mostly;
-EXPORT_SYMBOL_GPL(nf_conntrack_untracked);
+DEFINE_PER_CPU(struct nf_conn, pcpu_nf_conntrack_untracked);
+EXPORT_PER_CPU_SYMBOL(pcpu_nf_conntrack_untracked);
 
 static int nf_conntrack_hash_rnd_initted;
 static unsigned int nf_conntrack_hash_rnd;
@@ -1185,10 +1185,16 @@ static void nf_ct_release_dying_list(struct net *net)
 
 static void nf_conntrack_cleanup_init_net(void)
 {
-	/* wait until all references to nf_conntrack_untracked are dropped */
-	while (atomic_read(&nf_conntrack_untracked.ct_general.use) > 1)
+	int cpu, use;
+	for (;;) {
+		use = 0;
+		for_each_possible_cpu(cpu)
+			use += atomic_read(&per_cpu(pcpu_nf_conntrack_untracked, cpu).ct_general.use) - 1;
+		/* wait until all references to nf_conntrack_untracked are dropped */
+		if (!use)
+			break;
 		schedule();
-
+	}
 	nf_conntrack_helper_fini();
 	nf_conntrack_proto_fini();
 #ifdef CONFIG_NF_CONNTRACK_ZONES
@@ -1325,6 +1331,7 @@ static int nf_conntrack_init_init_net(void)
 {
 	int max_factor = 8;
 	int ret;
+	int cpu;
 
 	/* Idea from tcp.c: use 1/16384 of memory.  On i386: 32MB
 	 * machine has 512 buckets. >= 1GB machines have 16384 buckets. */
@@ -1362,14 +1369,15 @@ static int nf_conntrack_init_init_net(void)
 	if (ret < 0)
 		goto err_extend;
 #endif
-	/* Set up fake conntrack: to never be deleted, not in any hashes */
-#ifdef CONFIG_NET_NS
-	nf_conntrack_untracked.ct_net = &init_net;
-#endif
-	atomic_set(&nf_conntrack_untracked.ct_general.use, 1);
-	/*  - and look it like as a confirmed connection */
-	set_bit(IPS_CONFIRMED_BIT, &nf_conntrack_untracked.status);
+	/* Set up fake conntracks: to never be deleted, not in any hashes */
+	for_each_possible_cpu(cpu) {
+		struct nf_conn *ct = &per_cpu(pcpu_nf_conntrack_untracked, cpu);
 
+		write_pnet(&ct->ct_net, &init_net);
+		atomic_set(&ct->ct_general.use, 1);
+	/*  - and look it like as a confirmed connection */
+		__set_bit(IPS_CONFIRMED_BIT, &ct->status);
+	}
 	return 0;
 
 #ifdef CONFIG_NF_CONNTRACK_ZONES
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index c42ff6a..ac21514 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -479,9 +479,9 @@ ctnetlink_conntrack_event(unsigned int events, struct nf_ct_event *item)
 	unsigned int flags = 0, group;
 	int err;
 
-	/* ignore our fake conntrack entry */
-	if (ct == &nf_conntrack_untracked)
-		return 0;
+//	/* ignore our fake conntrack entry */
+//	if (ct == &nf_conntrack_untracked)
+//		return 0;
 
 	if (events & (1 << IPCT_DESTROY)) {
 		type = IPCTNL_MSG_CT_DELETE;
diff --git a/net/netfilter/xt_CT.c b/net/netfilter/xt_CT.c
index 562bf32..5723f9a 100644
--- a/net/netfilter/xt_CT.c
+++ b/net/netfilter/xt_CT.c
@@ -29,9 +29,13 @@ static unsigned int xt_ct_target(struct sk_buff *skb,
 	if (skb->nfct != NULL)
 		return XT_CONTINUE;
 
+	skb->nfctinfo = IP_CT_NEW;
+	if (info->flags & XT_CT_NOTRACK) {
+		ct = &__get_cpu_var(pcpu_nf_conntrack_untracked);
+		skb->nfctinfo = IP_CT_UNTRACKED;
+	}
 	atomic_inc(&ct->ct_general.use);
 	skb->nfct = &ct->ct_general;
-	skb->nfctinfo = IP_CT_NEW;
 
 	return XT_CONTINUE;
 }
@@ -67,8 +71,7 @@ static int xt_ct_tg_check(const struct xt_tgchk_param *par)
 		return -EINVAL;
 
 	if (info->flags & XT_CT_NOTRACK) {
-		ct = &nf_conntrack_untracked;
-		atomic_inc(&ct->ct_general.use);
+		ct = &__get_cpu_var(pcpu_nf_conntrack_untracked);
 		goto out;
 	}
 
@@ -132,14 +135,14 @@ static void xt_ct_tg_destroy(const struct xt_tgdtor_param *par)
 	struct nf_conn *ct = info->ct;
 	struct nf_conn_help *help;
 
-	if (ct != &nf_conntrack_untracked) {
+	if (!(info->flags & XT_CT_NOTRACK)) {
 		help = nfct_help(ct);
 		if (help)
 			module_put(help->helper->me);
 
 		nf_ct_l3proto_module_put(par->family);
+		nf_ct_put(info->ct);
 	}
-	nf_ct_put(info->ct);
 }
 
 static struct xt_target xt_ct_tg __read_mostly = {
diff --git a/net/netfilter/xt_NOTRACK.c b/net/netfilter/xt_NOTRACK.c
index 512b912..9547b58 100644
--- a/net/netfilter/xt_NOTRACK.c
+++ b/net/netfilter/xt_NOTRACK.c
@@ -23,8 +23,8 @@ notrack_tg(struct sk_buff *skb, const struct xt_action_param *par)
 	   If there is a real ct entry correspondig to this packet,
 	   it'll hang aroun till timing out. We don't deal with it
 	   for performance reasons. JK */
-	skb->nfct = &nf_conntrack_untracked.ct_general;
-	skb->nfctinfo = IP_CT_NEW;
+	skb->nfct = &__get_cpu_var(pcpu_nf_conntrack_untracked).ct_general;
+	skb->nfctinfo = IP_CT_UNTRACKED;
 	nf_conntrack_get(skb->nfct);
 
 	return XT_CONTINUE;
diff --git a/net/netfilter/xt_TEE.c b/net/netfilter/xt_TEE.c
index 859d9fd..b8e46b3 100644
--- a/net/netfilter/xt_TEE.c
+++ b/net/netfilter/xt_TEE.c
@@ -104,8 +104,8 @@ tee_tg4(struct sk_buff *skb, const struct xt_action_param *par)
 #ifdef WITH_CONNTRACK
 	/* Avoid counting cloned packets towards the original connection. */
 	nf_conntrack_put(skb->nfct);
-	skb->nfct     = &nf_conntrack_untracked.ct_general;
-	skb->nfctinfo = IP_CT_NEW;
+	skb->nfct     = &__get_cpu_var(pcpu_nf_conntrack_untracked).ct_general;
+	skb->nfctinfo = IP_CT_UNTRACKED;
 	nf_conntrack_get(skb->nfct);
 #endif
 	/*
@@ -177,8 +177,8 @@ tee_tg6(struct sk_buff *skb, const struct xt_action_param *par)
 
 #ifdef WITH_CONNTRACK
 	nf_conntrack_put(skb->nfct);
-	skb->nfct     = &nf_conntrack_untracked.ct_general;
-	skb->nfctinfo = IP_CT_NEW;
+	skb->nfct     = &__get_cpu_var(pcpu_nf_conntrack_untracked).ct_general;
+	skb->nfctinfo = IP_CT_UNTRACKED;
 	nf_conntrack_get(skb->nfct);
 #endif
 	if (par->hooknum == NF_INET_PRE_ROUTING ||
diff --git a/net/netfilter/xt_cluster.c b/net/netfilter/xt_cluster.c
index 30b95a1..b26f94d 100644
--- a/net/netfilter/xt_cluster.c
+++ b/net/netfilter/xt_cluster.c
@@ -120,7 +120,7 @@ xt_cluster_mt(const struct sk_buff *skb, struct xt_action_param *par)
 	if (ct == NULL)
 		return false;
 
-	if (ct == &nf_conntrack_untracked)
+	if (nf_ct_is_untracked(skb))
 		return false;
 
 	if (ct->master)
diff --git a/net/netfilter/xt_conntrack.c b/net/netfilter/xt_conntrack.c
index 39681f1..95bcfbb 100644
--- a/net/netfilter/xt_conntrack.c
+++ b/net/netfilter/xt_conntrack.c
@@ -123,7 +123,7 @@ conntrack_mt(const struct sk_buff *skb, struct xt_action_param *par,
 
 	ct = nf_ct_get(skb, &ctinfo);
 
-	if (ct == &nf_conntrack_untracked)
+	if (nf_ct_is_untracked(skb))
 		statebit = XT_CONNTRACK_STATE_UNTRACKED;
 	else if (ct != NULL)
 		statebit = XT_CONNTRACK_STATE_BIT(ctinfo);
diff --git a/net/netfilter/xt_socket.c b/net/netfilter/xt_socket.c
index 3d54c23..1f760b5 100644
--- a/net/netfilter/xt_socket.c
+++ b/net/netfilter/xt_socket.c
@@ -127,7 +127,7 @@ socket_match(const struct sk_buff *skb, struct xt_action_param *par,
 	 * reply packet of an established SNAT-ted connection. */
 
 	ct = nf_ct_get(skb, &ctinfo);
-	if (ct && (ct != &nf_conntrack_untracked) &&
+	if (ct && nf_ct_is_tracked(skb) &&
 	    ((iph->protocol != IPPROTO_ICMP &&
 	      ctinfo == IP_CT_IS_REPLY + IP_CT_ESTABLISHED) ||
 	     (iph->protocol == IPPROTO_ICMP &&



^ permalink raw reply related

* [PATCH] Refactor update of IPv6 flow destination address for srcrt (RH) option
From: Arnaud Ebalard @ 2010-06-01 16:20 UTC (permalink / raw)
  To: David S. Miller
  Cc: YOSHIFUJI Hideaki / 吉藤英明, netdev

Hi,

While playing with associated code, I noticed there are more than a
dozen occurrences of the following in the IPv6 stack:

    if (opt && opt->srcrt) {
            struct rt0_hdr *rt0 = (struct rt0_hdr *) opt->srcrt;
            ipv6_addr_copy(&final, &fl.fl6_dst);
            ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
            final_p = &final;
    }

Maybe this can be replaced with a helper (not sure about the name
though).


Signed-off-by: Arnaud Ebalard <arno@natisbad.org>
---
 include/net/ipv6.h               |   13 +++++++++++++
 net/dccp/ipv6.c                  |   24 +++---------------------
 net/ipv6/af_inet6.c              |    7 +------
 net/ipv6/datagram.c              |   16 +++-------------
 net/ipv6/inet6_connection_sock.c |    7 +------
 net/ipv6/raw.c                   |    8 +-------
 net/ipv6/syncookies.c            |    7 +------
 net/ipv6/tcp_ipv6.c              |   21 +++------------------
 net/ipv6/udp.c                   |    9 ++-------
 9 files changed, 28 insertions(+), 84 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 2600b69..9edfa1c 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -290,6 +290,19 @@ static inline void ipv6_addr_copy(struct in6_addr *a1, const struct in6_addr *a2
 	memcpy(a1, a2, sizeof(struct in6_addr));
 }
 
+static inline struct in6_addr *srcrt_dst_flow_update(struct in6_addr *final,
+						     struct in6_addr *fl6dst,
+						     const struct ipv6_txoptions *opt)
+{
+	if (opt && opt->srcrt) {
+		const struct rt0_hdr *rt0 = (struct rt0_hdr *)opt->srcrt;
+		ipv6_addr_copy(final, fl6dst);
+		ipv6_addr_copy(fl6dst, rt0->addr);
+		return final;
+	}
+	return NULL;
+}
+
 static inline void ipv6_addr_prefix(struct in6_addr *pfx, 
 				    const struct in6_addr *addr,
 				    int plen)
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 0916988..fc435ba 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -265,13 +265,7 @@ static int dccp_v6_send_response(struct sock *sk, struct request_sock *req,
 
 	opt = np->opt;
 
-	if (opt != NULL && opt->srcrt != NULL) {
-		const struct rt0_hdr *rt0 = (struct rt0_hdr *)opt->srcrt;
-
-		ipv6_addr_copy(&final, &fl.fl6_dst);
-		ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
-		final_p = &final;
-	}
+	final_p = srcrt_dst_flow_update(&final, &fl.fl6_dst, opt);
 
 	err = ip6_dst_lookup(sk, &dst, &fl);
 	if (err)
@@ -551,13 +545,7 @@ static struct sock *dccp_v6_request_recv_sock(struct sock *sk,
 		memset(&fl, 0, sizeof(fl));
 		fl.proto = IPPROTO_DCCP;
 		ipv6_addr_copy(&fl.fl6_dst, &ireq6->rmt_addr);
-		if (opt != NULL && opt->srcrt != NULL) {
-			const struct rt0_hdr *rt0 = (struct rt0_hdr *)opt->srcrt;
-
-			ipv6_addr_copy(&final, &fl.fl6_dst);
-			ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
-			final_p = &final;
-		}
+		final_p = srcrt_dst_flow_update(&final, &fl.fl6_dst, opt);
 		ipv6_addr_copy(&fl.fl6_src, &ireq6->loc_addr);
 		fl.oif = sk->sk_bound_dev_if;
 		fl.fl_ip_dport = inet_rsk(req)->rmt_port;
@@ -988,13 +976,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	fl.fl_ip_sport = inet->inet_sport;
 	security_sk_classify_flow(sk, &fl);
 
-	if (np->opt != NULL && np->opt->srcrt != NULL) {
-		const struct rt0_hdr *rt0 = (struct rt0_hdr *)np->opt->srcrt;
-
-		ipv6_addr_copy(&final, &fl.fl6_dst);
-		ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
-		final_p = &final;
-	}
+	final_p = srcrt_dst_flow_update(&final, &fl.fl6_dst, np->opt);
 
 	err = ip6_dst_lookup(sk, &dst, &fl);
 	if (err)
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index e733942..d6e6491 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -665,12 +665,7 @@ int inet6_sk_rebuild_header(struct sock *sk)
 		fl.fl_ip_sport = inet->inet_sport;
 		security_sk_classify_flow(sk, &fl);
 
-		if (np->opt && np->opt->srcrt) {
-			struct rt0_hdr *rt0 = (struct rt0_hdr *) np->opt->srcrt;
-			ipv6_addr_copy(&final, &fl.fl6_dst);
-			ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
-			final_p = &final;
-		}
+		final_p = srcrt_dst_flow_update(&final, &fl.fl6_dst, np->opt);
 
 		err = ip6_dst_lookup(sk, &dst, &fl);
 		if (err) {
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 7126846..2fcba06 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -42,6 +42,7 @@ int ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	struct dst_entry	*dst;
 	struct flowi		fl;
 	struct ip6_flowlabel	*flowlabel = NULL;
+	struct ipv6_txoptions   *opt;
 	int			addr_type;
 	int			err;
 
@@ -155,19 +156,8 @@ ipv4_connected:
 
 	security_sk_classify_flow(sk, &fl);
 
-	if (flowlabel) {
-		if (flowlabel->opt && flowlabel->opt->srcrt) {
-			struct rt0_hdr *rt0 = (struct rt0_hdr *) flowlabel->opt->srcrt;
-			ipv6_addr_copy(&final, &fl.fl6_dst);
-			ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
-			final_p = &final;
-		}
-	} else if (np->opt && np->opt->srcrt) {
-		struct rt0_hdr *rt0 = (struct rt0_hdr *)np->opt->srcrt;
-		ipv6_addr_copy(&final, &fl.fl6_dst);
-		ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
-		final_p = &final;
-	}
+	opt = flowlabel ? flowlabel->opt : np->opt;
+	final_p = srcrt_dst_flow_update(&final, &fl.fl6_dst, opt);
 
 	err = ip6_dst_lookup(sk, &dst, &fl);
 	if (err)
diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
index 0c5e3c3..505c8de 100644
--- a/net/ipv6/inet6_connection_sock.c
+++ b/net/ipv6/inet6_connection_sock.c
@@ -199,12 +199,7 @@ int inet6_csk_xmit(struct sk_buff *skb)
 	fl.fl_ip_dport = inet->inet_dport;
 	security_sk_classify_flow(sk, &fl);
 
-	if (np->opt && np->opt->srcrt) {
-		struct rt0_hdr *rt0 = (struct rt0_hdr *)np->opt->srcrt;
-		ipv6_addr_copy(&final, &fl.fl6_dst);
-		ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
-		final_p = &final;
-	}
+	final_p = srcrt_dst_flow_update(&final, &fl.fl6_dst, np->opt);
 
 	dst = __inet6_csk_dst_check(sk, np->dst_cookie);
 
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 4a4dcbe..91211bd 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -847,13 +847,7 @@ static int rawv6_sendmsg(struct kiocb *iocb, struct sock *sk,
 	if (ipv6_addr_any(&fl.fl6_src) && !ipv6_addr_any(&np->saddr))
 		ipv6_addr_copy(&fl.fl6_src, &np->saddr);
 
-	/* merge ip6_build_xmit from ip6_output */
-	if (opt && opt->srcrt) {
-		struct rt0_hdr *rt0 = (struct rt0_hdr *) opt->srcrt;
-		ipv6_addr_copy(&final, &fl.fl6_dst);
-		ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
-		final_p = &final;
-	}
+	final_p = srcrt_dst_flow_update(&final, &fl.fl6_dst, opt);
 
 	if (!fl.oif && ipv6_addr_is_multicast(&fl.fl6_dst))
 		fl.oif = np->mcast_oif;
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 34d1f06..7d27bc7 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -245,12 +245,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 		memset(&fl, 0, sizeof(fl));
 		fl.proto = IPPROTO_TCP;
 		ipv6_addr_copy(&fl.fl6_dst, &ireq6->rmt_addr);
-		if (np->opt && np->opt->srcrt) {
-			struct rt0_hdr *rt0 = (struct rt0_hdr *) np->opt->srcrt;
-			ipv6_addr_copy(&final, &fl.fl6_dst);
-			ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
-			final_p = &final;
-		}
+		final_p = srcrt_dst_flow_update(&final, &fl.fl6_dst, np->opt);
 		ipv6_addr_copy(&fl.fl6_src, &ireq6->loc_addr);
 		fl.oif = sk->sk_bound_dev_if;
 		fl.mark = sk->sk_mark;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 2b7c3a1..2dcb536 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -250,12 +250,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	fl.fl_ip_dport = usin->sin6_port;
 	fl.fl_ip_sport = inet->inet_sport;
 
-	if (np->opt && np->opt->srcrt) {
-		struct rt0_hdr *rt0 = (struct rt0_hdr *)np->opt->srcrt;
-		ipv6_addr_copy(&final, &fl.fl6_dst);
-		ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
-		final_p = &final;
-	}
+	final_p = srcrt_dst_flow_update(&final, &fl.fl6_dst, np->opt);
 
 	security_sk_classify_flow(sk, &fl);
 
@@ -494,12 +489,7 @@ static int tcp_v6_send_synack(struct sock *sk, struct request_sock *req,
 	security_req_classify_flow(req, &fl);
 
 	opt = np->opt;
-	if (opt && opt->srcrt) {
-		struct rt0_hdr *rt0 = (struct rt0_hdr *) opt->srcrt;
-		ipv6_addr_copy(&final, &fl.fl6_dst);
-		ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
-		final_p = &final;
-	}
+	final_p = srcrt_dst_flow_update(&final, &fl.fl6_dst, opt);
 
 	err = ip6_dst_lookup(sk, &dst, &fl);
 	if (err)
@@ -1398,12 +1388,7 @@ static struct sock * tcp_v6_syn_recv_sock(struct sock *sk, struct sk_buff *skb,
 		memset(&fl, 0, sizeof(fl));
 		fl.proto = IPPROTO_TCP;
 		ipv6_addr_copy(&fl.fl6_dst, &treq->rmt_addr);
-		if (opt && opt->srcrt) {
-			struct rt0_hdr *rt0 = (struct rt0_hdr *) opt->srcrt;
-			ipv6_addr_copy(&final, &fl.fl6_dst);
-			ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
-			final_p = &final;
-		}
+		final_p = srcrt_dst_flow_update(&final, &fl.fl6_dst, opt);
 		ipv6_addr_copy(&fl.fl6_src, &treq->loc_addr);
 		fl.oif = sk->sk_bound_dev_if;
 		fl.mark = sk->sk_mark;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 87be586..a46ad07 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1097,14 +1097,9 @@ do_udp_sendmsg:
 		ipv6_addr_copy(&fl.fl6_src, &np->saddr);
 	fl.fl_ip_sport = inet->inet_sport;
 
-	/* merge ip6_build_xmit from ip6_output */
-	if (opt && opt->srcrt) {
-		struct rt0_hdr *rt0 = (struct rt0_hdr *) opt->srcrt;
-		ipv6_addr_copy(&final, &fl.fl6_dst);
-		ipv6_addr_copy(&fl.fl6_dst, rt0->addr);
-		final_p = &final;
+	final_p = srcrt_dst_flow_update(&final, &fl.fl6_dst, opt);
+	if (final_p)
 		connected = 0;
-	}
 
 	if (!fl.oif && ipv6_addr_is_multicast(&fl.fl6_dst)) {
 		fl.oif = np->mcast_oif;
-- 
1.7.1



^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox