Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] drivers/net/tun.c: test for CAP_NET_ADMIN before attaching
From: David Miller @ 2011-07-29 14:36 UTC (permalink / raw)
  To: petar; +Cc: netdev
In-Reply-To: <20110729142808.GA25695@pintail.smokva.net>

From: Petar Bogdanovic <petar@smokva.net>
Date: Fri, 29 Jul 2011 16:28:08 +0200

> The following change will test for CAP_NET_ADMIN before attaching, even
> if both tun->owner and tun->group equal -1.  The latter scenario can be
> reproduced with ip(8) from iproute2, when using `ip tuntap' without the
> `user' and `group' option.

Please read linux/Documentation/SubmittingPatches, for example you
need to provide a proper "Signed-off-by: " tag at the end of your
commit message.

^ permalink raw reply

* Re: PROBLEM: BUG (NULL ptr dereference in ipv4_dst_check)
From: Eric Dumazet @ 2011-07-29 14:36 UTC (permalink / raw)
  To: synapse; +Cc: netdev
In-Reply-To: <4E32C302.8050304@hippy.csoma.elte.hu>

Le vendredi 29 juillet 2011 à 16:26 +0200, synapse a écrit :
> On 07/29/11 15:33, Eric Dumazet wrote:
> > Le vendredi 29 juillet 2011 à 15:18 +0200, synapse a écrit :
> >> Hello guys,
> >>
> >> I have a problem that I hope you can help me resolv. This is my first
> >> real bug report, so please be
> >> patient :)
> >>
> >> ### Description:
> >> 3.0.0-rc4 routinely locks up with BUG: unable to handle kernel NULL
> >> pointer dereference at 000000000000002c
> >> I have an intel sr2600 machine with a 10Gbit interface, it periodically
> >> locks up after a few days.
> >> It serves a lot of traffic. The trace is at the end of the mail.
> >> ###
> >>
> >> ### My efforts:
> >> I've traced the error back from atomic_dec_and_test() to:
> >>
> >> ipv4_dst_check()
> >> check_peer_redir()
> >> neigh_release()
> >> atomic_dec_and_test()
> >>
> >> The parameter to atomic_dec_and_test() is NULL (&neigh->refcnt in
> >> neigh_release), so atomic_dec_and_test()
> >> at /arch/x86/include/asm/atomic.h dies at offset 0xffffffff8140f56f.
> >>
> >> ffffffff8140f560:       48 8b 15 19 47 2f 00    mov
> >> 0x2f4719(%rip),%rdx        # 0xffffffff81703c80
> >> ffffffff8140f567:       48 89 50 18             mov    %rdx,0x18(%rax)
> >> ffffffff8140f56b:       48 8b 7b 40             mov    0x40(%rbx),%rdi
> >> ffffffff8140f56f:       f0 ff 4f 2c             lock decl 0x2c(%rdi)
> >> ffffffff8140f573:       0f 94 c0                sete   %al
> >> ffffffff8140f576:       84 c0                   test   %al,%al
> >> ffffffff8140f578:       0f 85 ab 00 00 00       jne    0xffffffff8140f629
> >>
> >>   From what I've seen is that this code is responsible for pmtu related
> >> things. The refcount member of struct neighbour
> >> is NULL and the neigh pointer (struct neighbour *) in neigh_release() is
> >> not. I have no clue how this might happen,
> >> though I suspect somebody releases the data structure somehow. Note that
> >> this code is invoked when redirect_learned.a4
> >> is set and is different from rt_gateway in ipv4_dst_check().
> >>
> >> Is it possible that two packets go to two different cores for processing
> >> and one core invalidates the rt entry
> >> the other is currently working on (meaning the second will try to
> >> dereference a NULL ptr)?
> >> ###
> >>
> >>
> >> This is just my clumsy attempt at tracking this down, I'm not a kernel
> >> expert unfortunately. I'm happy to provide
> >> further info on the matter. If I'm completely on the wrong track please
> >> let me know.
> >>
> >> Thank you for any help,
> >> Gergely Kalman
> >>
> > This bug was probably already fixed.
> >
> > Please try current linux tree
> >
> >
> found no relevant things in the diffs, except for a check against 
> DST_NOCOUNT
> when calling dst_entries_add(opc, 1). Will try with the new kernel, but 
> unfortunately
> it might take days to reproduce.

Hmm, I'll take a look, but check_peer_redir() seems suspicious at first
glance.




^ permalink raw reply

* [PATCH] drivers/net/tun.c: test for CAP_NET_ADMIN before attaching
From: Petar Bogdanovic @ 2011-07-29 14:28 UTC (permalink / raw)
  To: netdev

The following change will test for CAP_NET_ADMIN before attaching, even
if both tun->owner and tun->group equal -1.  The latter scenario can be
reproduced with ip(8) from iproute2, when using `ip tuntap' without the
`user' and `group' option.


diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 55f3a3e..2ac2cbc 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -988,35 +988,35 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 	struct net_device *dev;
 	int err;
 
 	dev = __dev_get_by_name(net, ifr->ifr_name);
 	if (dev) {
 		const struct cred *cred = current_cred();
 
 		if (ifr->ifr_flags & IFF_TUN_EXCL)
 			return -EBUSY;
 		if ((ifr->ifr_flags & IFF_TUN) && dev->netdev_ops == &tun_netdev_ops)
 			tun = netdev_priv(dev);
 		else if ((ifr->ifr_flags & IFF_TAP) && dev->netdev_ops == &tap_netdev_ops)
 			tun = netdev_priv(dev);
 		else
 			return -EINVAL;
 
-		if (((tun->owner != -1 && cred->euid != tun->owner) ||
-		     (tun->group != -1 && !in_egroup_p(tun->group))) &&
-		    !capable(CAP_NET_ADMIN))
+		if (!(tun->owner != -1 && cred->euid == tun->owner ||
+		      tun->group != -1 && in_egroup_p(tun->group) ||
+		      capable(CAP_NET_ADMIN)))
 			return -EPERM;
 		err = security_tun_dev_attach(tun->socket.sk);
 		if (err < 0)
 			return err;
 
 		err = tun_attach(tun, file);
 		if (err < 0)
 			return err;
 	}
 	else {
 		char *name;
 		unsigned long flags = 0;
 
 		if (!capable(CAP_NET_ADMIN))
 			return -EPERM;
 		err = security_tun_dev_create();

^ permalink raw reply related

* Re: [GIT PULL nf-2.6] IPVS
From: Patrick McHardy @ 2011-07-29 14:26 UTC (permalink / raw)
  To: David Miller
  Cc: horms, lvs-devel, netdev, netfilter-devel, netfilter, wensong, ja,
	pablo, davej, rdunlap, huajun.li.lee, davem
In-Reply-To: <20110728.183917.885521460762207540.davem@davemloft.net>

On 29.07.2011 03:39, David Miller wrote:
> From: Simon Horman <horms@verge.net.au>
> Date: Fri, 29 Jul 2011 09:12:06 +0900
> 
>> What is the best way forward to get this both in 3.1 and 3.0 -stable?
> 
> I'll take this directly and do the -stable submission too.
> 

Sorry, I seem to have missed this, also can't find the email in my
inbox for some reason. Anyways, generally please do the stable
submissions for IPVS yourself (after the patch went upstream of
course).

^ permalink raw reply

* Re: PROBLEM: BUG (NULL ptr dereference in ipv4_dst_check)
From: synapse @ 2011-07-29 14:26 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1311946421.2843.16.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

On 07/29/11 15:33, Eric Dumazet wrote:
> Le vendredi 29 juillet 2011 à 15:18 +0200, synapse a écrit :
>> Hello guys,
>>
>> I have a problem that I hope you can help me resolv. This is my first
>> real bug report, so please be
>> patient :)
>>
>> ### Description:
>> 3.0.0-rc4 routinely locks up with BUG: unable to handle kernel NULL
>> pointer dereference at 000000000000002c
>> I have an intel sr2600 machine with a 10Gbit interface, it periodically
>> locks up after a few days.
>> It serves a lot of traffic. The trace is at the end of the mail.
>> ###
>>
>> ### My efforts:
>> I've traced the error back from atomic_dec_and_test() to:
>>
>> ipv4_dst_check()
>> check_peer_redir()
>> neigh_release()
>> atomic_dec_and_test()
>>
>> The parameter to atomic_dec_and_test() is NULL (&neigh->refcnt in
>> neigh_release), so atomic_dec_and_test()
>> at /arch/x86/include/asm/atomic.h dies at offset 0xffffffff8140f56f.
>>
>> ffffffff8140f560:       48 8b 15 19 47 2f 00    mov
>> 0x2f4719(%rip),%rdx        # 0xffffffff81703c80
>> ffffffff8140f567:       48 89 50 18             mov    %rdx,0x18(%rax)
>> ffffffff8140f56b:       48 8b 7b 40             mov    0x40(%rbx),%rdi
>> ffffffff8140f56f:       f0 ff 4f 2c             lock decl 0x2c(%rdi)
>> ffffffff8140f573:       0f 94 c0                sete   %al
>> ffffffff8140f576:       84 c0                   test   %al,%al
>> ffffffff8140f578:       0f 85 ab 00 00 00       jne    0xffffffff8140f629
>>
>>   From what I've seen is that this code is responsible for pmtu related
>> things. The refcount member of struct neighbour
>> is NULL and the neigh pointer (struct neighbour *) in neigh_release() is
>> not. I have no clue how this might happen,
>> though I suspect somebody releases the data structure somehow. Note that
>> this code is invoked when redirect_learned.a4
>> is set and is different from rt_gateway in ipv4_dst_check().
>>
>> Is it possible that two packets go to two different cores for processing
>> and one core invalidates the rt entry
>> the other is currently working on (meaning the second will try to
>> dereference a NULL ptr)?
>> ###
>>
>>
>> This is just my clumsy attempt at tracking this down, I'm not a kernel
>> expert unfortunately. I'm happy to provide
>> further info on the matter. If I'm completely on the wrong track please
>> let me know.
>>
>> Thank you for any help,
>> Gergely Kalman
>>
> This bug was probably already fixed.
>
> Please try current linux tree
>
>
found no relevant things in the diffs, except for a check against 
DST_NOCOUNT
when calling dst_entries_add(opc, 1). Will try with the new kernel, but 
unfortunately
it might take days to reproduce.

Gergely Kalman


^ permalink raw reply

* Re: [PATCH] netfilter: xt_rateest: fix xt_rateest_mt_checkentry()
From: Patrick McHardy @ 2011-07-29 14:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Netfilter Development Mailinglist, netdev, Jan Engelhardt
In-Reply-To: <1311911462.7845.5.camel@edumazet-laptop>

On 29.07.2011 05:51, Eric Dumazet wrote:
> commit 4a5a5c73b7cfee (slightly better error reporting) added some
> useless code in xt_rateest_mt_checkentry().
> 
> Fix this so that different error codes can really be returned.
> 

Applied, thanks Eric.

^ permalink raw reply

* Re: [Bugme-new] [Bug 39372] New: Problems with HFSC Scheduler
From: Patrick McHardy @ 2011-07-29 14:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Michal Soltys, Andrew Morton, netdev, bugme-daemon,
	Jamal Hadi Salim, lucas.bocchi, 631945, 00bormoj, fdelawarde
In-Reply-To: <1311948052.2843.19.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

On 29.07.2011 16:00, Eric Dumazet wrote:
> Le vendredi 29 juillet 2011 à 15:27 +0200, Eric Dumazet a écrit :
>> Le vendredi 29 juillet 2011 à 14:29 +0200, Michal Soltys a écrit :
>>> On 11-07-15 00:14, Andrew Morton wrote:
>>>>
>>>> (switched to email.  Please respond via emailed reply-to-all, not via
>>>> the bugzilla web interface).
>>>>
>>>>
>>>> Here: WARN_ON(next_time == 0);
>>>>
>>>
>>> From the other thread on netfilter-devel:
>>>
>>>> On 11-07-22 11:58, Michal Pokrywka wrote: After bisecting 2.6.39.1 it
>>>> turned out that the bug is caused independently by two patches:
>>>>
>>>> commit b262a5da755cc6ed0cb4fba230cd9bf4037e1096 sch_sfq: fix peek()
>>>> implementation
>>>>
>>>> and
>>>>
>>>> commit 9df49f2bfe862573911a080c75a6d81113c5c81d sch_sfq: avoid giving
>>>> spurious NET_XMIT_CN signals
>>>>
>>>> Reverting these patches makes HFSC work again.
>>>>
>>>
>>> This one (upstream 8efa885406359af300d46910642b50ca82c0fe47) seems to be
>>> the culprit (does reverting only that one cures the problem ?)
>>>
>>> It allows SFQ to return success on enqueuing, when the packet really
>>> replaced some other packet in some other flow. This confuses outer qdisc
>>> (in this particular case HFSC) which thinks new packet was actually
>>> added each time such situation happes.
>>>
>>
>> Technically speaking, _this_ packet was successfuly enqueued.
>>
>> Returning NET_XMIT_CN or NET_XMIT_SUCCESS should not trigger a bug in
>> caller.
>>
>>> This in turn causes additional dequeues and ends with attempt
>>> to schedule non-existent packets, and triggers the warning.
>>>
>>
>> Then its probably a bug in HFSC : It doesnt understand SFQ lost a
>> packet.
>>
>> I'll take a look, thanks for the report.
>>
>>
> 
> Oh well, it seems one qdisc_tree_decrease_qlen(sch, 1) is missing
> 
> Maybe following patch would help...
> 
> 
> diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
> index 4536ee6..2a2d287 100644
> --- a/net/sched/sch_sfq.c
> +++ b/net/sched/sch_sfq.c
> @@ -410,7 +410,12 @@ sfq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
>  	/* Return Congestion Notification only if we dropped a packet
>  	 * from this flow.
>  	 */
> -	return (qlen != slot->qlen) ? NET_XMIT_CN : NET_XMIT_SUCCESS;
> +	if (qlen != slot->qlen)
> +		return NET_XMIT_CN;
> +
> +	/* as we dropped a packet, better let upper stack know this */
> +	qdisc_tree_decrease_qlen(sch, 1);
> +	return NET_XMIT_SUCCESS;
>  }
>  

Yeah, that seems to be the correct fix, thanks for looking into this.

^ permalink raw reply

* Re: [Bugme-new] [Bug 39372] New: Problems with HFSC Scheduler
From: Eric Dumazet @ 2011-07-29 14:00 UTC (permalink / raw)
  To: Michal Soltys
  Cc: Andrew Morton, netdev, bugme-daemon, Jamal Hadi Salim,
	lucas.bocchi, Patrick McHardy, 631945, 00bormoj, fdelawarde
In-Reply-To: <1311946060.2843.15.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

Le vendredi 29 juillet 2011 à 15:27 +0200, Eric Dumazet a écrit :
> Le vendredi 29 juillet 2011 à 14:29 +0200, Michal Soltys a écrit :
> > On 11-07-15 00:14, Andrew Morton wrote:
> > > 
> > > (switched to email.  Please respond via emailed reply-to-all, not via
> > > the bugzilla web interface).
> > > 
> > > 
> > > Here: WARN_ON(next_time == 0);
> > > 
> > 
> > From the other thread on netfilter-devel:
> > 
> > > On 11-07-22 11:58, Michal Pokrywka wrote: After bisecting 2.6.39.1 it
> > > turned out that the bug is caused independently by two patches:
> > > 
> > > commit b262a5da755cc6ed0cb4fba230cd9bf4037e1096 sch_sfq: fix peek()
> > > implementation
> > > 
> > > and
> > > 
> > > commit 9df49f2bfe862573911a080c75a6d81113c5c81d sch_sfq: avoid giving
> > > spurious NET_XMIT_CN signals
> > > 
> > > Reverting these patches makes HFSC work again.
> > > 
> > 
> > This one (upstream 8efa885406359af300d46910642b50ca82c0fe47) seems to be
> > the culprit (does reverting only that one cures the problem ?)
> > 
> > It allows SFQ to return success on enqueuing, when the packet really
> > replaced some other packet in some other flow. This confuses outer qdisc
> > (in this particular case HFSC) which thinks new packet was actually
> > added each time such situation happes.
> > 
> 
> Technically speaking, _this_ packet was successfuly enqueued.
> 
> Returning NET_XMIT_CN or NET_XMIT_SUCCESS should not trigger a bug in
> caller.
> 
> > This in turn causes additional dequeues and ends with attempt
> > to schedule non-existent packets, and triggers the warning.
> > 
> 
> Then its probably a bug in HFSC : It doesnt understand SFQ lost a
> packet.
> 
> I'll take a look, thanks for the report.
> 
> 

Oh well, it seems one qdisc_tree_decrease_qlen(sch, 1) is missing

Maybe following patch would help...


diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
index 4536ee6..2a2d287 100644
--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -410,7 +410,12 @@ sfq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 	/* Return Congestion Notification only if we dropped a packet
 	 * from this flow.
 	 */
-	return (qlen != slot->qlen) ? NET_XMIT_CN : NET_XMIT_SUCCESS;
+	if (qlen != slot->qlen)
+		return NET_XMIT_CN;
+
+	/* as we dropped a packet, better let upper stack know this */
+	qdisc_tree_decrease_qlen(sch, 1);
+	return NET_XMIT_SUCCESS;
 }
 
 static struct sk_buff *



^ permalink raw reply related

* Re: PROBLEM: BUG (NULL ptr dereference in ipv4_dst_check)
From: Eric Dumazet @ 2011-07-29 13:33 UTC (permalink / raw)
  To: synapse; +Cc: netdev
In-Reply-To: <4E32B33C.2020103@hippy.csoma.elte.hu>

Le vendredi 29 juillet 2011 à 15:18 +0200, synapse a écrit :
> Hello guys,
> 
> I have a problem that I hope you can help me resolv. This is my first 
> real bug report, so please be
> patient :)
> 
> ### Description:
> 3.0.0-rc4 routinely locks up with BUG: unable to handle kernel NULL 
> pointer dereference at 000000000000002c
> I have an intel sr2600 machine with a 10Gbit interface, it periodically 
> locks up after a few days.
> It serves a lot of traffic. The trace is at the end of the mail.
> ###
> 
> ### My efforts:
> I've traced the error back from atomic_dec_and_test() to:
> 
> ipv4_dst_check()
> check_peer_redir()
> neigh_release()
> atomic_dec_and_test()
> 
> The parameter to atomic_dec_and_test() is NULL (&neigh->refcnt in 
> neigh_release), so atomic_dec_and_test()
> at /arch/x86/include/asm/atomic.h dies at offset 0xffffffff8140f56f.
> 
> ffffffff8140f560:       48 8b 15 19 47 2f 00    mov    
> 0x2f4719(%rip),%rdx        # 0xffffffff81703c80
> ffffffff8140f567:       48 89 50 18             mov    %rdx,0x18(%rax)
> ffffffff8140f56b:       48 8b 7b 40             mov    0x40(%rbx),%rdi
> ffffffff8140f56f:       f0 ff 4f 2c             lock decl 0x2c(%rdi)
> ffffffff8140f573:       0f 94 c0                sete   %al
> ffffffff8140f576:       84 c0                   test   %al,%al
> ffffffff8140f578:       0f 85 ab 00 00 00       jne    0xffffffff8140f629
> 
>  From what I've seen is that this code is responsible for pmtu related 
> things. The refcount member of struct neighbour
> is NULL and the neigh pointer (struct neighbour *) in neigh_release() is 
> not. I have no clue how this might happen,
> though I suspect somebody releases the data structure somehow. Note that 
> this code is invoked when redirect_learned.a4
> is set and is different from rt_gateway in ipv4_dst_check().
> 
> Is it possible that two packets go to two different cores for processing 
> and one core invalidates the rt entry
> the other is currently working on (meaning the second will try to 
> dereference a NULL ptr)?
> ###
> 
> 
> This is just my clumsy attempt at tracking this down, I'm not a kernel 
> expert unfortunately. I'm happy to provide
> further info on the matter. If I'm completely on the wrong track please 
> let me know.
> 
> Thank you for any help,
> Gergely Kalman
> 

This bug was probably already fixed.

Please try current linux tree




^ permalink raw reply

* Re: [PATCH] sunrpc: use better NUMA affinities
From: Greg Banks @ 2011-07-29 13:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Hellwig, NeilBrown,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, David Miller,
	linux-kernel, netdev
In-Reply-To: <1311941491.2843.7.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>



Sent from my iPhone

On 29/07/2011, at 22:11, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> Le vendredi 29 juillet 2011 à 21:58 +1000, Greg Banks a écrit :
>
>>
>> Sure, and a whole lot of the callsites are ("..._%d", cpu), hence the
>> unfortune :(
>
> BTW, we could name nfsd threads differently :
>
> Currently, they all are named : "nfsd"
>
> If SVC_POOL_PERCPU is selected, we could name them :
> nfsd_c0 -> nfsd_cN
>
> If SVC_POOL_PERNODE is selected, we could name them :
> nfsd_n0  -> nfsd_nN
>
> That would help to check with "ps aux" which cpu/nodes are under  
> stress.
>
>

I like it!

>

Greg.--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [Bugme-new] [Bug 39372] New: Problems with HFSC Scheduler
From: Eric Dumazet @ 2011-07-29 13:27 UTC (permalink / raw)
  To: Michal Soltys
  Cc: Andrew Morton, netdev, bugme-daemon, Jamal Hadi Salim,
	lucas.bocchi, Patrick McHardy, 631945, 00bormoj, fdelawarde
In-Reply-To: <4E32A796.8060104@ziu.info>

Le vendredi 29 juillet 2011 à 14:29 +0200, Michal Soltys a écrit :
> On 11-07-15 00:14, Andrew Morton wrote:
> > 
> > (switched to email.  Please respond via emailed reply-to-all, not via
> > the bugzilla web interface).
> > 
> > 
> > Here: WARN_ON(next_time == 0);
> > 
> 
> From the other thread on netfilter-devel:
> 
> > On 11-07-22 11:58, Michal Pokrywka wrote: After bisecting 2.6.39.1 it
> > turned out that the bug is caused independently by two patches:
> > 
> > commit b262a5da755cc6ed0cb4fba230cd9bf4037e1096 sch_sfq: fix peek()
> > implementation
> > 
> > and
> > 
> > commit 9df49f2bfe862573911a080c75a6d81113c5c81d sch_sfq: avoid giving
> > spurious NET_XMIT_CN signals
> > 
> > Reverting these patches makes HFSC work again.
> > 
> 
> This one (upstream 8efa885406359af300d46910642b50ca82c0fe47) seems to be
> the culprit (does reverting only that one cures the problem ?)
> 
> It allows SFQ to return success on enqueuing, when the packet really
> replaced some other packet in some other flow. This confuses outer qdisc
> (in this particular case HFSC) which thinks new packet was actually
> added each time such situation happes.
> 

Technically speaking, _this_ packet was successfuly enqueued.

Returning NET_XMIT_CN or NET_XMIT_SUCCESS should not trigger a bug in
caller.

> This in turn causes additional dequeues and ends with attempt
> to schedule non-existent packets, and triggers the warning.
> 

Then its probably a bug in HFSC : It doesnt understand SFQ lost a
packet.

I'll take a look, thanks for the report.




^ permalink raw reply

* PROBLEM: BUG (NULL ptr dereference in ipv4_dst_check)
From: synapse @ 2011-07-29 13:18 UTC (permalink / raw)
  To: netdev

Hello guys,

I have a problem that I hope you can help me resolv. This is my first 
real bug report, so please be
patient :)

### Description:
3.0.0-rc4 routinely locks up with BUG: unable to handle kernel NULL 
pointer dereference at 000000000000002c
I have an intel sr2600 machine with a 10Gbit interface, it periodically 
locks up after a few days.
It serves a lot of traffic. The trace is at the end of the mail.
###

### My efforts:
I've traced the error back from atomic_dec_and_test() to:

ipv4_dst_check()
check_peer_redir()
neigh_release()
atomic_dec_and_test()

The parameter to atomic_dec_and_test() is NULL (&neigh->refcnt in 
neigh_release), so atomic_dec_and_test()
at /arch/x86/include/asm/atomic.h dies at offset 0xffffffff8140f56f.

ffffffff8140f560:       48 8b 15 19 47 2f 00    mov    
0x2f4719(%rip),%rdx        # 0xffffffff81703c80
ffffffff8140f567:       48 89 50 18             mov    %rdx,0x18(%rax)
ffffffff8140f56b:       48 8b 7b 40             mov    0x40(%rbx),%rdi
ffffffff8140f56f:       f0 ff 4f 2c             lock decl 0x2c(%rdi)
ffffffff8140f573:       0f 94 c0                sete   %al
ffffffff8140f576:       84 c0                   test   %al,%al
ffffffff8140f578:       0f 85 ab 00 00 00       jne    0xffffffff8140f629

 From what I've seen is that this code is responsible for pmtu related 
things. The refcount member of struct neighbour
is NULL and the neigh pointer (struct neighbour *) in neigh_release() is 
not. I have no clue how this might happen,
though I suspect somebody releases the data structure somehow. Note that 
this code is invoked when redirect_learned.a4
is set and is different from rt_gateway in ipv4_dst_check().

Is it possible that two packets go to two different cores for processing 
and one core invalidates the rt entry
the other is currently working on (meaning the second will try to 
dereference a NULL ptr)?
###


This is just my clumsy attempt at tracking this down, I'm not a kernel 
expert unfortunately. I'm happy to provide
further info on the matter. If I'm completely on the wrong track please 
let me know.

Thank you for any help,
Gergely Kalman


TRACE:
===============================================================
BUG: unable to handle kernel NULL pointer dereference at 000000000000002c
IP: [<ffffffff8140f56f>] ipv4_dst_check+0xaf/0x190
PGD 0
Oops: 0002 [#1] SMP
CPU 8
Modules linked in: 8021q garp bridge stp llc iptable_filter ip_tables 
ixgbe ioatdma mdio dca hed

Pid: 0, comm: kworker/0:1 Not tainted 3.0.0-rc4-10g-lvs-pktgen #1 Intel 
Corporation S5520UR/S5520UR
RIP: 0010:[<ffffffff8140f56f>]  [<ffffffff8140f56f>] 
ipv4_dst_check+0xaf/0x190
RSP: 0018:ffff8801efc83a40  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88014d428900 RCX: ffff8801a44fa000
RDX: 0000000000000000 RSI: ffff8801a4335bc0 RDI: 0000000000000000
RBP: 00000000fea2476d R08: 000000000000fa4b R09: 0000000000007d25
R10: 00000000000000c0 R11: 0000000000000003 R12: ffff8801a4335bc0
R13: 0000000000006bc1 R14: 0000000000000000 R15: ffff88016291da20
FS:  0000000000000000(0000) GS:ffff8801efc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000000002c CR3: 0000000001697000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/0:1 (pid: 0, threadinfo ffff8801e90ee000, task 
ffff8801e90d9680)
Stack:
  ffff88014d428900 ffff88016291d780 0000000000000000 ffffffff813dccfa
  ffff88036fff9000 ffff8801b77bfc58 ffff88016291d780 ffffffff81417a82
  ffff8801a44fb0a0 ffff88016291d780 ffff8801b77bfc58 ffff8801b77bfc80
Call Trace:
<IRQ>
  [<ffffffff813dccfa>] ? __sk_dst_check+0x4a/0x70
  [<ffffffff81417a82>] ? ip_queue_xmit+0x2b2/0x3c0
  [<ffffffff8142c23b>] ? tcp_transmit_skb+0x3bb/0x850
  [<ffffffff8142e8cc>] ? tcp_write_xmit+0x1ec/0xa10
  [<ffffffff8142f239>] ? __tcp_push_pending_frames+0x19/0x80
  [<ffffffff81426076>] ? tcp_data_snd_check+0x36/0x120
  [<ffffffff8142a5d9>] ? tcp_rcv_established+0x349/0x7c0
  [<ffffffff8143204f>] ? tcp_v4_do_rcv+0x10f/0x2e0
  [<ffffffff81412300>] ? ip_rcv_finish+0x350/0x350
  [<ffffffff81433102>] ? tcp_v4_rcv+0x4e2/0x7a0
  [<ffffffff8141237d>] ? ip_local_deliver_finish+0x7d/0x130
  [<ffffffff813e802e>] ? __netif_receive_skb+0x1ae/0x350
  [<ffffffff813edc78>] ? netif_receive_skb+0x78/0x80
  [<ffffffff813ee21b>] ? napi_gro_receive+0xbb/0xd0
  [<ffffffff813edda8>] ? napi_skb_finish+0x38/0x50
  [<ffffffffa004c372>] ? ixgbe_clean_rx_irq+0x4f2/0x780 [ixgbe]
  [<ffffffffa004eddd>] ? ixgbe_clean_rxtx_many+0xed/0x1f0 [ixgbe]
  [<ffffffff8120b890>] ? timerqueue_add+0x60/0xb0
  [<ffffffff813ee366>] ? net_rx_action+0x86/0x170
  [<ffffffff8104aab1>] ? __do_softirq+0x91/0x140
  [<ffffffff8107ccfa>] ? handle_irq_event_percpu+0x7a/0x140
  [<ffffffff81474e4c>] ? call_softirq+0x1c/0x30
  [<ffffffff8100428d>] ? do_softirq+0x4d/0x80
  [<ffffffff8104a975>] ? irq_exit+0xb5/0xc0
  [<ffffffff81003aac>] ? do_IRQ+0x5c/0xd0
  [<ffffffff814737d3>] ? common_interrupt+0x13/0x13
<EOI>
  [<ffffffff81251c8c>] ? acpi_hw_read_multiple+0x28/0x60
  [<ffffffff81261afd>] ? acpi_idle_enter_bm+0x22c/0x260
  [<ffffffff81261af8>] ? acpi_idle_enter_bm+0x227/0x260
  [<ffffffff813b7281>] ? cpuidle_idle_call+0x81/0xf0
  [<ffffffff810017d8>] ? cpu_idle+0x58/0xb0
Code: 00 89 83 d4 00 00 00 eb 98 0f 1f 00 48 85 db 74 16 48 8b 43 40 31 
ff 48 85 c0 74 0f 48 8b 15 19 47 2f 00 48 89 50 18 48 8b 7b 40 <f0> ff 
4f 2c 0f 94 c0 84 c0 0f 85 ab 00 00 00 48 c7 43 40 00 00
RIP  [<ffffffff8140f56f>] ipv4_dst_check+0xaf/0x190
  RSP <ffff8801efc83a40>
CR2: 000000000000002c
---[ end trace 8a3fd44eb302579f ]---

^ permalink raw reply

* Re: vpnc-script fix for changed iproute output with newer kernels
From: David Woodhouse @ 2011-07-29 12:57 UTC (permalink / raw)
  To: David Miller; +Cc: jsbronder, netdev, shemminger
In-Reply-To: <20110729.054649.1274733167127164255.davem@davemloft.net>

On Fri, 2011-07-29 at 05:46 -0700, David Miller wrote:
> From: David Woodhouse <dwmw2@infradead.org>
> Date: Fri, 29 Jul 2011 13:33:11 +0100
> 
> > Any suggestions that *aren't* going to be constantly broken?
> 
> You're going to have to be knowledgable about which attributes are
> part of the route, whether you want to do this with iproute2 as a tool
> or whether you do this directly with C code using netlink.

I don't think I really want to try shipping vpnc-script with C code.

The 'opt-in' approach seems like the best one for now, then. I suppose
we want just the 'via' and 'dev' and 'src' attributes... anything else?

I'll see if I can come up with a regex that can parse that, in the
knowledge that the interface itself might actually be called "src" or
"dev" or "via".

This may make my brain hurt.

> iproute2 is never going to allow you to mirror "route get" outputs
> into a "route add" call.  Because 'get' is going to always emit
> metrics and other transient state, upon which we will always
> potentially be buidling new items over time.

An option to make 'ip route get' do exactly that would be massively
appreciated :)

Or an option to make 'ip route set' ignore the ones it doesn't like,
perhaps.

-- 
dwmw2


^ permalink raw reply

* Re: vpnc-script fix for changed iproute output with newer kernels
From: David Miller @ 2011-07-29 12:46 UTC (permalink / raw)
  To: dwmw2; +Cc: jsbronder, netdev, shemminger
In-Reply-To: <1311942793.17528.57.camel@i7.infradead.org>

From: David Woodhouse <dwmw2@infradead.org>
Date: Fri, 29 Jul 2011 13:33:11 +0100

> Any suggestions that *aren't* going to be constantly broken?

You're going to have to be knowledgable about which attributes are
part of the route, whether you want to do this with iproute2 as a tool
or whether you do this directly with C code using netlink.

If you want to script this using iproute2, you should be grepping for
the attributes you want to keep rather then grepping for the
attributes you end up dropping.

iproute2 is never going to allow you to mirror "route get" outputs
into a "route add" call.  Because 'get' is going to always emit
metrics and other transient state, upon which we will always
potentially be buidling new items over time.

^ permalink raw reply

* Re: vpnc-script fix for changed iproute output with newer kernels
From: David Woodhouse @ 2011-07-29 12:33 UTC (permalink / raw)
  To: Justin Bronder; +Cc: netdev, shemminger
In-Reply-To: <20110728021853.GB3620@gmail.com>

On Thu, 2011-07-28 at 03:18 +0100, Justin Bronder wrote:
> From 0a1c10c83f2043f00793c166ad351dc643bcefe3 Mon Sep 17 00:00:00 2001
> From: Justin Bronder <jsbronder@gmail.com>
> Date: Wed, 27 Jul 2011 22:10:06 -0400
> Subject: [PATCH] fix for newer kernels
> 
> newer kernels have added expires and mtu to the ip route output
> ---
>  vpnc-script |    8 +++++++-
>  1 files changed, 7 insertions(+), 1 deletions(-)
> 
> diff --git a/vpnc-script b/vpnc-script
> index e0140c5..b071e0b 100755
> --- a/vpnc-script
> +++ b/vpnc-script
> @@ -139,7 +139,13 @@ destroy_tun_device() {
>  
>  if [ -n "$IPROUTE" ]; then
>  	fix_ip_get_output () {
> -		sed 's/cache//;s/metric \?[0-9]\+ [0-9]\+//g;s/hoplimit [0-9]\+//g;s/ipid 0x....//g'
> +		sed \
> +            -e 's/cache//' \
> +            -e ';s/metric \?[0-9]\+ [0-9]\+//g' \
> +            -e 's/hoplimit [0-9]\+//g' \
> +            -e 's/ipid 0x....//g' \
> +            -e 's/expires [0-9]\+sec//g' \
> +            -e 's/mtu [0-9]\+//g'
>  	}
>  
>  	set_vpngateway_route() {

Thanks for this, Justin. But I'd really prefer not to do it this way.

This is the second time in as many kernel releases that this has broken;
we only added 'ipid' to that regex in May. If we have to keep doing this
dance, we are doing it *wrong*.

Stephen, what is the *right* way to do this?

This is for vpnc-script, as you ought to be able to tell from the patch
header. If we're adding routes to the newly-created VPN device, we first
have to ensure that the route to the VPN server *itself* doesn't change.

So effectively we want to do: 
 ip route add $(ip route get $VPNSERVER)

... except then we have to have that awful bunch of sed crap to make it
work right. I suppose we could at least make it opt-in, and include the
'via' and 'dev' and 'src' options and remove *everything* else? But that
doesn't really fill me with joy *either*.

Any suggestions that *aren't* going to be constantly broken?

-- 
dwmw2


^ permalink raw reply

* Bug#631945: [Bugme-new] [Bug 39372] New: Problems with HFSC Scheduler
From: Michal Soltys @ 2011-07-29 12:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: netdev, bugme-daemon, Jamal Hadi Salim, lucas.bocchi,
	Patrick McHardy, 631945, 00bormoj, fdelawarde
In-Reply-To: <20110714151425.844b7738.akpm@linux-foundation.org>

On 11-07-15 00:14, Andrew Morton wrote:
> 
> (switched to email.  Please respond via emailed reply-to-all, not via
> the bugzilla web interface).
> 
> 
> Here: WARN_ON(next_time == 0);
> 

>From the other thread on netfilter-devel:

> On 11-07-22 11:58, Michal Pokrywka wrote: After bisecting 2.6.39.1 it
> turned out that the bug is caused independently by two patches:
> 
> commit b262a5da755cc6ed0cb4fba230cd9bf4037e1096 sch_sfq: fix peek()
> implementation
> 
> and
> 
> commit 9df49f2bfe862573911a080c75a6d81113c5c81d sch_sfq: avoid giving
> spurious NET_XMIT_CN signals
> 
> Reverting these patches makes HFSC work again.
> 

This one (upstream 8efa885406359af300d46910642b50ca82c0fe47) seems to be
the culprit (does reverting only that one cures the problem ?)

It allows SFQ to return success on enqueuing, when the packet really
replaced some other packet in some other flow. This confuses outer qdisc
(in this particular case HFSC) which thinks new packet was actually
added each time such situation happes.

This in turn causes additional dequeues and ends with attempt
to schedule non-existent packets, and triggers the warning.


ps.

removed netfilter from cc, as it's not really netfilter issue.

^ permalink raw reply

* Re: [PATCH] sunrpc: use better NUMA affinities
From: Eric Dumazet @ 2011-07-29 12:11 UTC (permalink / raw)
  To: Greg Banks
  Cc: Christoph Hellwig, NeilBrown, linux-nfs@vger.kernel.org,
	David Miller, linux-kernel, netdev
In-Reply-To: <5933F48C-49D6-492D-AB7B-B76A3ADDB6C6@fastmail.fm>

Le vendredi 29 juillet 2011 à 21:58 +1000, Greg Banks a écrit :

> 
> Sure, and a whole lot of the callsites are ("..._%d", cpu), hence the  
> unfortune :(

BTW, we could name nfsd threads differently :

Currently, they all are named : "nfsd"

If SVC_POOL_PERCPU is selected, we could name them : 
nfsd_c0 -> nfsd_cN

If SVC_POOL_PERNODE is selected, we could name them :
nfsd_n0  -> nfsd_nN

That would help to check with "ps aux" which cpu/nodes are under stress.

^ permalink raw reply

* Re: [PATCH] sunrpc: use better NUMA affinities
From: Greg Banks @ 2011-07-29 11:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Eric Dumazet, NeilBrown, linux-nfs@vger.kernel.org, David Miller,
	linux-kernel, netdev
In-Reply-To: <20110729103634.GA12050@infradead.org>



Sent from my iPhone

On 29/07/2011, at 20:36, Christoph Hellwig <hch@infradead.org> wrote:

> On Fri, Jul 29, 2011 at 04:53:21PM +1000, Greg Banks wrote:
>>> Check commit 94dcf29a11b3d20a (kthread: use kthread_create_on_node 
>>> ()) to
>>> see how this strategy already was adopted for ksoftirqd, kworker,
>>> migration, and pktgend kthreads.
>>
>> Ah, I see.  It's unfortunate that the kthread_create() API ends up
>> being passed a CPU number but that's only used to format the name
>> and not for sensible things :(
>
> kthread_create doesn't have a cpu argument - it has a printf-like  
> format
> string.
>

Sure, and a whole lot of the callsites are ("..._%d", cpu), hence the  
unfortune :(

Greg.

^ permalink raw reply

* Re: [PATCH 0/2] pktgen: Clone skb to avoid corruption of skbs in ndo_start_xmit methods (v3)
From: Neil Horman @ 2011-07-29 11:19 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: mashirle, netdev
In-Reply-To: <20110729061642.GA9884@redhat.com>

On Fri, Jul 29, 2011 at 09:16:42AM +0300, Michael S. Tsirkin wrote:
> On Tue, Jul 26, 2011 at 12:05:36PM -0400, Neil Horman wrote:
> > Ok, after considering all your comments, Dave suggested this as an alternate
> > approach:
> > 
> > 1) We create a new priv_flag, IFF_SKB_TX_SHARED, to identify drivers capable of
> > handling shared skbs.  Default is to not set this flag
> > 
> > 2) Modify ether_setup to enable this flag, under the assumption that any driver
> > calling this  function is initalizing a real ethernet device and as such can
> > handle shared skbs since they don't tend to store state in the skb struct.
> > Pktgen can then query this flag when a user script attempts to issue the
> > clone_skb command and decide if it is to be alowed or not.
> > 
> > 3) Audit the network drivers calling ether_setup to identify any code doing so
> > that can't actualy handle shared skbs and manually disable the new flag.  There
> > are about 10 drivers in this category.
> > 
> > Change notes:
> > v3) Fixed Erics note in which I tested the length of the passed in string rather
> > than its converted value for beign > 0
> > 
> > Thoughts/reviews aprpeciated.
> > Neil
> 
> It might be a good idea to disable vhost-net zerocopy for
> such devices as well: these skbs are shared with userspace.
> Shirley, what do you think?
> 
I don't think thats a problem, since (IIUC) only skbs with (tx_flags &
SKBTX_DEV_ZEROCOPY) set can do that.  The pktgen skbs originate in the kernel
and never have that flag set).
Neil

P.S. Don't call me Shirley :).  Sorry, its not every day you get to use that
line.

> -- 
> MST
> 

^ permalink raw reply

* Re: Fw: [PATCH] sunrpc: use better NUMA affinities
From: Christoph Hellwig @ 2011-07-29 10:36 UTC (permalink / raw)
  To: Greg Banks
  Cc: Eric Dumazet, NeilBrown, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	David Miller, linux-kernel, netdev
In-Reply-To: <4E3258E1.6020000-97jfqw80gc6171pxa8y+qA@public.gmane.org>

On Fri, Jul 29, 2011 at 04:53:21PM +1000, Greg Banks wrote:
> >Check commit 94dcf29a11b3d20a (kthread: use kthread_create_on_node()) to
> >see how this strategy already was adopted for ksoftirqd, kworker,
> >migration, and pktgend kthreads.
> 
> Ah, I see.  It's unfortunate that the kthread_create() API ends up
> being passed a CPU number but that's only used to format the name
> and not for sensible things :(

kthread_create doesn't have a cpu argument - it has a printf-like format
string.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] net: filter: Convert the BPF VM to threaded code
From: Eric Dumazet @ 2011-07-29  9:30 UTC (permalink / raw)
  To: Rui Ueyama; +Cc: netdev
In-Reply-To: <CACKH++ZfNTB7Y8YhvQnZPEXpwmpWXzxQgnWniamDrjRWUwxaNw@mail.gmail.com>

Le vendredi 29 juillet 2011 à 01:10 -0700, Rui Ueyama a écrit :
> Convert the BPF VM to threaded code to improve performance.
> 
> The BPF VM is basically a big for loop containing a switch statement.  That is
> slow because for each instruction it checks the for loop condition and does the
> conditional branch of the switch statement.
> 
> This patch eliminates the conditional branch, by replacing it with jump table
> using GCC's labels-as-values feature. The for loop condition check can also be
> removed, because the filter code always end with a RET instruction.
> 

Well...


> +#define NEXT goto *jump_table[(++fentry)->code]
> +
> +	/* Dispatch the first instruction */
> +	goto *jump_table[fentry->code];

This is the killer, as this cannot be predicted by the cpu.

Do you have benchmark results to provide ?

We now have BPF JIT on x86_64 and powerpc, and possibly on MIPS and ARM
on a near future.




^ permalink raw reply

* [patch] cfg80211: off by one in nl80211_trigger_scan()
From: Dan Carpenter @ 2011-07-29  8:52 UTC (permalink / raw)
  To: Johannes Berg
  Cc: John W. Linville, David S. Miller, open list:CFG80211 and NL80211,
	open list:NETWORKING [GENERAL],
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA

The test is off by one so we'd read past the end of the
wiphy->bands[] array on the next line.

Signed-off-by: Dan Carpenter <error27-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

diff --git a/net/wireless/nl80211.c b/net/wireless/nl80211.c
index 28d2aa1..e83e7fe 100644
--- a/net/wireless/nl80211.c
+++ b/net/wireless/nl80211.c
@@ -3464,7 +3464,7 @@ static int nl80211_trigger_scan(struct sk_buff *skb, struct genl_info *info)
 				    tmp) {
 			enum ieee80211_band band = nla_type(attr);
 
-			if (band < 0 || band > IEEE80211_NUM_BANDS) {
+			if (band < 0 || band >= IEEE80211_NUM_BANDS) {
 				err = -EINVAL;
 				goto out_free;
 			}
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH v2] net/smsc911x: add device tree probe support
From: Shawn Guo @ 2011-07-29  8:43 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: patches-QSEj5FYQhm4dnm+yROfE0A,
	devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ, Steve Glendinning,
	David S. Miller,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r
In-Reply-To: <1311587040-8988-1-git-send-email-shawn.guo-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>

It adds device tree probe support for smsc911x driver.

Signed-off-by: Shawn Guo <shawn.guo-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
Cc: Grant Likely <grant.likely-s3s/WqlpOiPyB63q8FvJNQ@public.gmane.org>
Cc: Steve Glendinning <steve.glendinning-sdUf+H5yV5I@public.gmane.org>
Cc: David S. Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
---
Changes since v1:
 * Instead of getting irq line from gpio number, it use irq domain
   to keep platform_get_resource(IORESOURCE_IRQ) works for dt too.
 * Use 'lan9115' the first model that smsc911x supports in the match
   table
 * Use reg-shift and reg-io-width which already used in of_serial for
   shift and access size binding

 Documentation/devicetree/bindings/net/smsc911x.txt |   38 +++++++++
 drivers/net/smsc911x.c                             |   85 +++++++++++++++++---
 2 files changed, 112 insertions(+), 11 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/net/smsc911x.txt

diff --git a/Documentation/devicetree/bindings/net/smsc911x.txt b/Documentation/devicetree/bindings/net/smsc911x.txt
new file mode 100644
index 0000000..271c454
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/smsc911x.txt
@@ -0,0 +1,38 @@
+* Smart Mixed-Signal Connectivity (SMSC) LAN911x/912x Controller
+
+Required properties:
+- compatible : Should be "smsc,lan<model>", "smsc,lan9115"
+- reg : Address and length of the io space for SMSC LAN
+- interrupts : Should contain SMSC LAN interrupt line
+- interrupt-parent : Should be the phandle for the interrupt controller
+  that services interrupts for this device
+- phy-mode : String, operation mode of the PHY interface.
+  Supported values are: "mii", "gmii", "sgmii", "tbi", "rmii",
+  "rgmii", "rgmii-id", "rgmii-rxid", "rgmii-txid", "rtbi", "smii".
+
+Optional properties:
+- reg-shift : Specify the quantity to shift the register offsets by
+- reg-io-width : Specify the size (in bytes) of the IO accesses that
+  should be performed on the device.  Valid value for SMSC LAN is
+  2 or 4.  If it's omitted or invalid, the size would be 2.
+- smsc,irq-active-high : Indicates the IRQ polarity is active-low
+- smsc,irq-push-pull : Indicates the IRQ type is push-pull
+- smsc,force-internal-phy : Forces SMSC LAN controller to use
+  internal PHY
+- smsc,force-external-phy : Forces SMSC LAN controller to use
+  external PHY
+- smsc,save-mac-address : Indicates that mac address needs to be saved
+  before resetting the controller
+- local-mac-address : 6 bytes, mac address
+
+Examples:
+
+lan9220@f4000000 {
+	compatible = "smsc,lan9220", "smsc,lan9115";
+	reg = <0xf4000000 0x2000000>;
+	phy-mode = "mii";
+	interrupt-parent = <&gpio1>;
+	interrupts = <31>;
+	reg-io-width = <4>;
+	smsc,irq-push-pull;
+};
diff --git a/drivers/net/smsc911x.c b/drivers/net/smsc911x.c
index b9016a3..75c08a5 100644
--- a/drivers/net/smsc911x.c
+++ b/drivers/net/smsc911x.c
@@ -53,6 +53,10 @@
 #include <linux/phy.h>
 #include <linux/smsc911x.h>
 #include <linux/device.h>
+#include <linux/of.h>
+#include <linux/of_device.h>
+#include <linux/of_gpio.h>
+#include <linux/of_net.h>
 #include "smsc911x.h"
 
 #define SMSC_CHIPNAME		"smsc911x"
@@ -2095,8 +2099,58 @@ static const struct smsc911x_ops shifted_smsc911x_ops = {
 	.tx_writefifo = smsc911x_tx_writefifo_shift,
 };
 
+#ifdef CONFIG_OF
+static int __devinit smsc911x_probe_config_dt(
+				struct smsc911x_platform_config *config,
+				struct device_node *np)
+{
+	const char *mac;
+	u32 width = 0;
+
+	if (!np)
+		return -ENODEV;
+
+	config->phy_interface = of_get_phy_mode(np);
+
+	mac = of_get_mac_address(np);
+	if (mac)
+		memcpy(config->mac, mac, ETH_ALEN);
+
+	of_property_read_u32(np, "reg-shift", &config->shift);
+
+	of_property_read_u32(np, "reg-io-width", &width);
+	if (width == 4)
+		config->flags |= SMSC911X_USE_32BIT;
+
+	if (of_get_property(np, "smsc,irq-active-high", NULL))
+		config->irq_polarity = SMSC911X_IRQ_POLARITY_ACTIVE_HIGH;
+
+	if (of_get_property(np, "smsc,irq-push-pull", NULL))
+		config->irq_type = SMSC911X_IRQ_TYPE_PUSH_PULL;
+
+	if (of_get_property(np, "smsc,force-internal-phy", NULL))
+		config->flags |= SMSC911X_FORCE_INTERNAL_PHY;
+
+	if (of_get_property(np, "smsc,force-external-phy", NULL))
+		config->flags |= SMSC911X_FORCE_EXTERNAL_PHY;
+
+	if (of_get_property(np, "smsc,save-mac-address", NULL))
+		config->flags |= SMSC911X_SAVE_MAC_ADDRESS;
+
+	return 0;
+}
+#else
+static inline int smsc911x_probe_config_dt(
+				struct smsc911x_platform_config *config,
+				struct device_node *np)
+{
+	return -ENODEV;
+}
+#endif /* CONFIG_OF */
+
 static int __devinit smsc911x_drv_probe(struct platform_device *pdev)
 {
+	struct device_node *np = pdev->dev.of_node;
 	struct net_device *dev;
 	struct smsc911x_data *pdata;
 	struct smsc911x_platform_config *config = pdev->dev.platform_data;
@@ -2107,13 +2161,6 @@ static int __devinit smsc911x_drv_probe(struct platform_device *pdev)
 
 	pr_info("Driver version %s\n", SMSC_DRV_VERSION);
 
-	/* platform data specifies irq & dynamic bus configuration */
-	if (!pdev->dev.platform_data) {
-		pr_warn("platform_data not provided\n");
-		retval = -ENODEV;
-		goto out_0;
-	}
-
 	res = platform_get_resource_byname(pdev, IORESOURCE_MEM,
 					   "smsc911x-memory");
 	if (!res)
@@ -2152,9 +2199,6 @@ static int __devinit smsc911x_drv_probe(struct platform_device *pdev)
 	irq_flags = irq_res->flags & IRQF_TRIGGER_MASK;
 	pdata->ioaddr = ioremap_nocache(res->start, res_size);
 
-	/* copy config parameters across to pdata */
-	memcpy(&pdata->config, config, sizeof(pdata->config));
-
 	pdata->dev = dev;
 	pdata->msg_enable = ((1 << debug) - 1);
 
@@ -2164,10 +2208,22 @@ static int __devinit smsc911x_drv_probe(struct platform_device *pdev)
 		goto out_free_netdev_2;
 	}
 
+	retval = smsc911x_probe_config_dt(&pdata->config, np);
+	if (retval && config) {
+		/* copy config parameters across to pdata */
+		memcpy(&pdata->config, config, sizeof(pdata->config));
+		retval = 0;
+	}
+
+	if (retval) {
+		SMSC_WARN(pdata, probe, "Error smsc911x config not found");
+		goto out_unmap_io_3;
+	}
+
 	/* assume standard, non-shifted, access to HW registers */
 	pdata->ops = &standard_smsc911x_ops;
 	/* apply the right access if shifting is needed */
-	if (config->shift)
+	if (pdata->config.shift)
 		pdata->ops = &shifted_smsc911x_ops;
 
 	retval = smsc911x_init(dev);
@@ -2314,6 +2370,12 @@ static const struct dev_pm_ops smsc911x_pm_ops = {
 #define SMSC911X_PM_OPS NULL
 #endif
 
+static const struct of_device_id smsc911x_dt_ids[] = {
+	{ .compatible = "smsc,lan9115", },
+	{ /* sentinel */ }
+};
+MODULE_DEVICE_TABLE(of, smsc911x_dt_ids);
+
 static struct platform_driver smsc911x_driver = {
 	.probe = smsc911x_drv_probe,
 	.remove = __devexit_p(smsc911x_drv_remove),
@@ -2321,6 +2383,7 @@ static struct platform_driver smsc911x_driver = {
 		.name	= SMSC_CHIPNAME,
 		.owner	= THIS_MODULE,
 		.pm	= SMSC911X_PM_OPS,
+		.of_match_table = smsc911x_dt_ids,
 	},
 };
 
-- 
1.7.4.1

^ permalink raw reply related

* [PATCH] net: filter: Convert the BPF VM to threaded code
From: Rui Ueyama @ 2011-07-29  8:10 UTC (permalink / raw)
  To: netdev

Convert the BPF VM to threaded code to improve performance.

The BPF VM is basically a big for loop containing a switch statement.  That is
slow because for each instruction it checks the for loop condition and does the
conditional branch of the switch statement.

This patch eliminates the conditional branch, by replacing it with jump table
using GCC's labels-as-values feature. The for loop condition check can also be
removed, because the filter code always end with a RET instruction.

Signed-off-by: Rui Ueyama <rui314@gmail.com>
---
 include/linux/filter.h      |   60 +------
 include/linux/filter_inst.h |   57 ++++++
 net/core/filter.c           |  457 +++++++++++++++++++++----------------------
 3 files changed, 289 insertions(+), 285 deletions(-)
 create mode 100644 include/linux/filter_inst.h

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 741956f..2f72166 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -172,62 +172,10 @@ static inline void bpf_jit_free(struct sk_filter *fp)
 #endif

 enum {
-	BPF_S_RET_K = 1,
-	BPF_S_RET_A,
-	BPF_S_ALU_ADD_K,
-	BPF_S_ALU_ADD_X,
-	BPF_S_ALU_SUB_K,
-	BPF_S_ALU_SUB_X,
-	BPF_S_ALU_MUL_K,
-	BPF_S_ALU_MUL_X,
-	BPF_S_ALU_DIV_X,
-	BPF_S_ALU_AND_K,
-	BPF_S_ALU_AND_X,
-	BPF_S_ALU_OR_K,
-	BPF_S_ALU_OR_X,
-	BPF_S_ALU_LSH_K,
-	BPF_S_ALU_LSH_X,
-	BPF_S_ALU_RSH_K,
-	BPF_S_ALU_RSH_X,
-	BPF_S_ALU_NEG,
-	BPF_S_LD_W_ABS,
-	BPF_S_LD_H_ABS,
-	BPF_S_LD_B_ABS,
-	BPF_S_LD_W_LEN,
-	BPF_S_LD_W_IND,
-	BPF_S_LD_H_IND,
-	BPF_S_LD_B_IND,
-	BPF_S_LD_IMM,
-	BPF_S_LDX_W_LEN,
-	BPF_S_LDX_B_MSH,
-	BPF_S_LDX_IMM,
-	BPF_S_MISC_TAX,
-	BPF_S_MISC_TXA,
-	BPF_S_ALU_DIV_K,
-	BPF_S_LD_MEM,
-	BPF_S_LDX_MEM,
-	BPF_S_ST,
-	BPF_S_STX,
-	BPF_S_JMP_JA,
-	BPF_S_JMP_JEQ_K,
-	BPF_S_JMP_JEQ_X,
-	BPF_S_JMP_JGE_K,
-	BPF_S_JMP_JGE_X,
-	BPF_S_JMP_JGT_K,
-	BPF_S_JMP_JGT_X,
-	BPF_S_JMP_JSET_K,
-	BPF_S_JMP_JSET_X,
-	/* Ancillary data */
-	BPF_S_ANC_PROTOCOL,
-	BPF_S_ANC_PKTTYPE,
-	BPF_S_ANC_IFINDEX,
-	BPF_S_ANC_NLATTR,
-	BPF_S_ANC_NLATTR_NEST,
-	BPF_S_ANC_MARK,
-	BPF_S_ANC_QUEUE,
-	BPF_S_ANC_HATYPE,
-	BPF_S_ANC_RXHASH,
-	BPF_S_ANC_CPU,
+	BPF_S_DUMMY = 0,
+#define BPF_INST(inst) inst,
+#include "filter_inst.h"
+#undef BPF_INST
 };

 #endif /* __KERNEL__ */
diff --git a/include/linux/filter_inst.h b/include/linux/filter_inst.h
new file mode 100644
index 0000000..235a797
--- /dev/null
+++ b/include/linux/filter_inst.h
@@ -0,0 +1,57 @@
+BPF_INST(BPF_S_RET_K)
+BPF_INST(BPF_S_RET_A)
+BPF_INST(BPF_S_ALU_ADD_K)
+BPF_INST(BPF_S_ALU_ADD_X)
+BPF_INST(BPF_S_ALU_SUB_K)
+BPF_INST(BPF_S_ALU_SUB_X)
+BPF_INST(BPF_S_ALU_MUL_K)
+BPF_INST(BPF_S_ALU_MUL_X)
+BPF_INST(BPF_S_ALU_DIV_X)
+BPF_INST(BPF_S_ALU_AND_K)
+BPF_INST(BPF_S_ALU_AND_X)
+BPF_INST(BPF_S_ALU_OR_K)
+BPF_INST(BPF_S_ALU_OR_X)
+BPF_INST(BPF_S_ALU_LSH_K)
+BPF_INST(BPF_S_ALU_LSH_X)
+BPF_INST(BPF_S_ALU_RSH_K)
+BPF_INST(BPF_S_ALU_RSH_X)
+BPF_INST(BPF_S_ALU_NEG)
+BPF_INST(BPF_S_LD_W_ABS)
+BPF_INST(BPF_S_LD_H_ABS)
+BPF_INST(BPF_S_LD_B_ABS)
+BPF_INST(BPF_S_LD_W_LEN)
+BPF_INST(BPF_S_LD_W_IND)
+BPF_INST(BPF_S_LD_H_IND)
+BPF_INST(BPF_S_LD_B_IND)
+BPF_INST(BPF_S_LD_IMM)
+BPF_INST(BPF_S_LDX_W_LEN)
+BPF_INST(BPF_S_LDX_B_MSH)
+BPF_INST(BPF_S_LDX_IMM)
+BPF_INST(BPF_S_MISC_TAX)
+BPF_INST(BPF_S_MISC_TXA)
+BPF_INST(BPF_S_ALU_DIV_K)
+BPF_INST(BPF_S_LD_MEM)
+BPF_INST(BPF_S_LDX_MEM)
+BPF_INST(BPF_S_ST)
+BPF_INST(BPF_S_STX)
+BPF_INST(BPF_S_JMP_JA)
+BPF_INST(BPF_S_JMP_JEQ_K)
+BPF_INST(BPF_S_JMP_JEQ_X)
+BPF_INST(BPF_S_JMP_JGE_K)
+BPF_INST(BPF_S_JMP_JGE_X)
+BPF_INST(BPF_S_JMP_JGT_K)
+BPF_INST(BPF_S_JMP_JGT_X)
+BPF_INST(BPF_S_JMP_JSET_K)
+BPF_INST(BPF_S_JMP_JSET_X)
+
+/* Ancillary data */
+BPF_INST(BPF_S_ANC_PROTOCOL)
+BPF_INST(BPF_S_ANC_PKTTYPE)
+BPF_INST(BPF_S_ANC_IFINDEX)
+BPF_INST(BPF_S_ANC_NLATTR)
+BPF_INST(BPF_S_ANC_NLATTR_NEST)
+BPF_INST(BPF_S_ANC_MARK)
+BPF_INST(BPF_S_ANC_QUEUE)
+BPF_INST(BPF_S_ANC_HATYPE)
+BPF_INST(BPF_S_ANC_RXHASH)
+BPF_INST(BPF_S_ANC_CPU)
diff --git a/net/core/filter.c b/net/core/filter.c
index 36f975f..e0c9d2c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -119,245 +119,244 @@ unsigned int sk_run_filter(const struct sk_buff *skb,
 	u32 tmp;
 	int k;

-	/*
-	 * Process array of filter instructions.
-	 */
-	for (;; fentry++) {
-#if defined(CONFIG_X86_32)
+	static void *jump_table[] = {
+		NULL,
+#define BPF_INST(inst) &&inst,
+#include <linux/filter_inst.h>
+#undef BPF_INST
+	};
+
 #define	K (fentry->k)
-#else
-		const u32 K = fentry->k;
-#endif
-
-		switch (fentry->code) {
-		case BPF_S_ALU_ADD_X:
-			A += X;
-			continue;
-		case BPF_S_ALU_ADD_K:
-			A += K;
-			continue;
-		case BPF_S_ALU_SUB_X:
-			A -= X;
-			continue;
-		case BPF_S_ALU_SUB_K:
-			A -= K;
-			continue;
-		case BPF_S_ALU_MUL_X:
-			A *= X;
-			continue;
-		case BPF_S_ALU_MUL_K:
-			A *= K;
-			continue;
-		case BPF_S_ALU_DIV_X:
-			if (X == 0)
-				return 0;
-			A /= X;
-			continue;
-		case BPF_S_ALU_DIV_K:
-			A = reciprocal_divide(A, K);
-			continue;
-		case BPF_S_ALU_AND_X:
-			A &= X;
-			continue;
-		case BPF_S_ALU_AND_K:
-			A &= K;
-			continue;
-		case BPF_S_ALU_OR_X:
-			A |= X;
-			continue;
-		case BPF_S_ALU_OR_K:
-			A |= K;
-			continue;
-		case BPF_S_ALU_LSH_X:
-			A <<= X;
-			continue;
-		case BPF_S_ALU_LSH_K:
-			A <<= K;
-			continue;
-		case BPF_S_ALU_RSH_X:
-			A >>= X;
-			continue;
-		case BPF_S_ALU_RSH_K:
-			A >>= K;
-			continue;
-		case BPF_S_ALU_NEG:
-			A = -A;
-			continue;
-		case BPF_S_JMP_JA:
-			fentry += K;
-			continue;
-		case BPF_S_JMP_JGT_K:
-			fentry += (A > K) ? fentry->jt : fentry->jf;
-			continue;
-		case BPF_S_JMP_JGE_K:
-			fentry += (A >= K) ? fentry->jt : fentry->jf;
-			continue;
-		case BPF_S_JMP_JEQ_K:
-			fentry += (A == K) ? fentry->jt : fentry->jf;
-			continue;
-		case BPF_S_JMP_JSET_K:
-			fentry += (A & K) ? fentry->jt : fentry->jf;
-			continue;
-		case BPF_S_JMP_JGT_X:
-			fentry += (A > X) ? fentry->jt : fentry->jf;
-			continue;
-		case BPF_S_JMP_JGE_X:
-			fentry += (A >= X) ? fentry->jt : fentry->jf;
-			continue;
-		case BPF_S_JMP_JEQ_X:
-			fentry += (A == X) ? fentry->jt : fentry->jf;
-			continue;
-		case BPF_S_JMP_JSET_X:
-			fentry += (A & X) ? fentry->jt : fentry->jf;
-			continue;
-		case BPF_S_LD_W_ABS:
-			k = K;
-load_w:
-			ptr = load_pointer(skb, k, 4, &tmp);
-			if (ptr != NULL) {
-				A = get_unaligned_be32(ptr);
-				continue;
-			}
+#define NEXT goto *jump_table[(++fentry)->code]
+
+	/* Dispatch the first instruction */
+	goto *jump_table[fentry->code];
+
+ BPF_S_ALU_ADD_X:
+	A += X;
+	NEXT;
+ BPF_S_ALU_ADD_K:
+	A += K;
+	NEXT;
+ BPF_S_ALU_SUB_X:
+	A -= X;
+	NEXT;
+ BPF_S_ALU_SUB_K:
+	A -= K;
+	NEXT;
+ BPF_S_ALU_MUL_X:
+	A *= X;
+	NEXT;
+ BPF_S_ALU_MUL_K:
+	A *= K;
+	NEXT;
+ BPF_S_ALU_DIV_X:
+	if (X == 0)
+		return 0;
+	A /= X;
+	NEXT;
+ BPF_S_ALU_DIV_K:
+	A = reciprocal_divide(A, K);
+	NEXT;
+ BPF_S_ALU_AND_X:
+	A &= X;
+	NEXT;
+ BPF_S_ALU_AND_K:
+	A &= K;
+	NEXT;
+ BPF_S_ALU_OR_X:
+	A |= X;
+	NEXT;
+ BPF_S_ALU_OR_K:
+	A |= K;
+	NEXT;
+ BPF_S_ALU_LSH_X:
+	A <<= X;
+	NEXT;
+ BPF_S_ALU_LSH_K:
+	A <<= K;
+	NEXT;
+ BPF_S_ALU_RSH_X:
+	A >>= X;
+	NEXT;
+ BPF_S_ALU_RSH_K:
+	A >>= K;
+	NEXT;
+ BPF_S_ALU_NEG:
+	A = -A;
+	NEXT;
+ BPF_S_JMP_JA:
+	fentry += K;
+	NEXT;
+ BPF_S_JMP_JGT_K:
+	fentry += (A > K) ? fentry->jt : fentry->jf;
+	NEXT;
+ BPF_S_JMP_JGE_K:
+	fentry += (A >= K) ? fentry->jt : fentry->jf;
+	NEXT;
+ BPF_S_JMP_JEQ_K:
+	fentry += (A == K) ? fentry->jt : fentry->jf;
+	NEXT;
+ BPF_S_JMP_JSET_K:
+	fentry += (A & K) ? fentry->jt : fentry->jf;
+	NEXT;
+ BPF_S_JMP_JGT_X:
+	fentry += (A > X) ? fentry->jt : fentry->jf;
+	NEXT;
+ BPF_S_JMP_JGE_X:
+	fentry += (A >= X) ? fentry->jt : fentry->jf;
+	NEXT;
+ BPF_S_JMP_JEQ_X:
+	fentry += (A == X) ? fentry->jt : fentry->jf;
+	NEXT;
+ BPF_S_JMP_JSET_X:
+	fentry += (A & X) ? fentry->jt : fentry->jf;
+	NEXT;
+ BPF_S_LD_W_ABS:
+	k = K;
+ load_w:
+	ptr = load_pointer(skb, k, 4, &tmp);
+	if (ptr != NULL) {
+		A = get_unaligned_be32(ptr);
+		NEXT;
+	}
+	return 0;
+ BPF_S_LD_H_ABS:
+	k = K;
+ load_h:
+	ptr = load_pointer(skb, k, 2, &tmp);
+	if (ptr != NULL) {
+		A = get_unaligned_be16(ptr);
+		NEXT;
+	}
+	return 0;
+ BPF_S_LD_B_ABS:
+	k = K;
+ load_b:
+	ptr = load_pointer(skb, k, 1, &tmp);
+	if (ptr != NULL) {
+		A = *(u8 *)ptr;
+		NEXT;
+	}
+	return 0;
+ BPF_S_LD_W_LEN:
+	A = skb->len;
+	NEXT;
+ BPF_S_LDX_W_LEN:
+	X = skb->len;
+	NEXT;
+ BPF_S_LD_W_IND:
+	k = X + K;
+	goto load_w;
+ BPF_S_LD_H_IND:
+	k = X + K;
+	goto load_h;
+ BPF_S_LD_B_IND:
+	k = X + K;
+	goto load_b;
+ BPF_S_LDX_B_MSH:
+	ptr = load_pointer(skb, K, 1, &tmp);
+	if (ptr != NULL) {
+		X = (*(u8 *)ptr & 0xf) << 2;
+		NEXT;
+	}
+	return 0;
+ BPF_S_LD_IMM:
+	A = K;
+	NEXT;
+ BPF_S_LDX_IMM:
+	X = K;
+	NEXT;
+ BPF_S_LD_MEM:
+	A = mem[K];
+	NEXT;
+ BPF_S_LDX_MEM:
+	X = mem[K];
+	NEXT;
+ BPF_S_MISC_TAX:
+	X = A;
+	NEXT;
+ BPF_S_MISC_TXA:
+	A = X;
+	NEXT;
+ BPF_S_RET_K:
+	return K;
+ BPF_S_RET_A:
+	return A;
+ BPF_S_ST:
+	mem[K] = A;
+	NEXT;
+ BPF_S_STX:
+	mem[K] = X;
+	NEXT;
+ BPF_S_ANC_PROTOCOL:
+	A = ntohs(skb->protocol);
+	NEXT;
+ BPF_S_ANC_PKTTYPE:
+	A = skb->pkt_type;
+	NEXT;
+ BPF_S_ANC_IFINDEX:
+	if (!skb->dev)
+		return 0;
+	A = skb->dev->ifindex;
+	NEXT;
+ BPF_S_ANC_MARK:
+	A = skb->mark;
+	NEXT;
+ BPF_S_ANC_QUEUE:
+	A = skb->queue_mapping;
+	NEXT;
+ BPF_S_ANC_HATYPE:
+	if (!skb->dev)
+		return 0;
+	A = skb->dev->type;
+	NEXT;
+ BPF_S_ANC_RXHASH:
+	A = skb->rxhash;
+	NEXT;
+ BPF_S_ANC_CPU:
+	A = raw_smp_processor_id();
+	NEXT;
+ BPF_S_ANC_NLATTR:
+	{
+		struct nlattr *nla;
+
+		if (skb_is_nonlinear(skb))
 			return 0;
-		case BPF_S_LD_H_ABS:
-			k = K;
-load_h:
-			ptr = load_pointer(skb, k, 2, &tmp);
-			if (ptr != NULL) {
-				A = get_unaligned_be16(ptr);
-				continue;
-			}
+		if (A > skb->len - sizeof(struct nlattr))
 			return 0;
-		case BPF_S_LD_B_ABS:
-			k = K;
-load_b:
-			ptr = load_pointer(skb, k, 1, &tmp);
-			if (ptr != NULL) {
-				A = *(u8 *)ptr;
-				continue;
-			}
+
+		nla = nla_find((struct nlattr *)&skb->data[A],
+			       skb->len - A, X);
+		if (nla)
+			A = (void *)nla - (void *)skb->data;
+		else
+			A = 0;
+	}
+	NEXT;
+ BPF_S_ANC_NLATTR_NEST:
+	{
+		struct nlattr *nla;
+
+		if (skb_is_nonlinear(skb))
 			return 0;
-		case BPF_S_LD_W_LEN:
-			A = skb->len;
-			continue;
-		case BPF_S_LDX_W_LEN:
-			X = skb->len;
-			continue;
-		case BPF_S_LD_W_IND:
-			k = X + K;
-			goto load_w;
-		case BPF_S_LD_H_IND:
-			k = X + K;
-			goto load_h;
-		case BPF_S_LD_B_IND:
-			k = X + K;
-			goto load_b;
-		case BPF_S_LDX_B_MSH:
-			ptr = load_pointer(skb, K, 1, &tmp);
-			if (ptr != NULL) {
-				X = (*(u8 *)ptr & 0xf) << 2;
-				continue;
-			}
+		if (A > skb->len - sizeof(struct nlattr))
 			return 0;
-		case BPF_S_LD_IMM:
-			A = K;
-			continue;
-		case BPF_S_LDX_IMM:
-			X = K;
-			continue;
-		case BPF_S_LD_MEM:
-			A = mem[K];
-			continue;
-		case BPF_S_LDX_MEM:
-			X = mem[K];
-			continue;
-		case BPF_S_MISC_TAX:
-			X = A;
-			continue;
-		case BPF_S_MISC_TXA:
-			A = X;
-			continue;
-		case BPF_S_RET_K:
-			return K;
-		case BPF_S_RET_A:
-			return A;
-		case BPF_S_ST:
-			mem[K] = A;
-			continue;
-		case BPF_S_STX:
-			mem[K] = X;
-			continue;
-		case BPF_S_ANC_PROTOCOL:
-			A = ntohs(skb->protocol);
-			continue;
-		case BPF_S_ANC_PKTTYPE:
-			A = skb->pkt_type;
-			continue;
-		case BPF_S_ANC_IFINDEX:
-			if (!skb->dev)
-				return 0;
-			A = skb->dev->ifindex;
-			continue;
-		case BPF_S_ANC_MARK:
-			A = skb->mark;
-			continue;
-		case BPF_S_ANC_QUEUE:
-			A = skb->queue_mapping;
-			continue;
-		case BPF_S_ANC_HATYPE:
-			if (!skb->dev)
-				return 0;
-			A = skb->dev->type;
-			continue;
-		case BPF_S_ANC_RXHASH:
-			A = skb->rxhash;
-			continue;
-		case BPF_S_ANC_CPU:
-			A = raw_smp_processor_id();
-			continue;
-		case BPF_S_ANC_NLATTR: {
-			struct nlattr *nla;
-
-			if (skb_is_nonlinear(skb))
-				return 0;
-			if (A > skb->len - sizeof(struct nlattr))
-				return 0;
-
-			nla = nla_find((struct nlattr *)&skb->data[A],
-				       skb->len - A, X);
-			if (nla)
-				A = (void *)nla - (void *)skb->data;
-			else
-				A = 0;
-			continue;
-		}
-		case BPF_S_ANC_NLATTR_NEST: {
-			struct nlattr *nla;
-
-			if (skb_is_nonlinear(skb))
-				return 0;
-			if (A > skb->len - sizeof(struct nlattr))
-				return 0;
-
-			nla = (struct nlattr *)&skb->data[A];
-			if (nla->nla_len > A - skb->len)
-				return 0;
-
-			nla = nla_find_nested(nla, X);
-			if (nla)
-				A = (void *)nla - (void *)skb->data;
-			else
-				A = 0;
-			continue;
-		}
-		default:
-			WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n",
-				       fentry->code, fentry->jt,
-				       fentry->jf, fentry->k);
+
+		nla = (struct nlattr *)&skb->data[A];
+		if (nla->nla_len > A - skb->len)
 			return 0;
-		}
+
+		nla = nla_find_nested(nla, X);
+		if (nla)
+			A = (void *)nla - (void *)skb->data;
+		else
+			A = 0;
 	}
+	NEXT;

+#undef K
+#undef NEXT
 	return 0;
 }
 EXPORT_SYMBOL(sk_run_filter);
-- 
1.7.3.1

^ permalink raw reply related

* Re: Kernel IPSec Questions
From: Andreas Steffen @ 2011-07-29  7:03 UTC (permalink / raw)
  To: T C; +Cc: netdev
In-Reply-To: <CAL0-=Wwb+T_VYgMnOW9UiqQ2gVe8FaJCZgcFmHMLX1Yv3tVAdQ@mail.gmail.com>

Hello Terry,

here a repost of my email including the netdev list and fixing
the last URL which was wrong.

Here the definition of strongSwan's IPsec high level kernel interface

http://git.strongswan.org/?p=strongswan.git;a=blob;f=src/libhydra/kernel/kernel_ipsec.h;h=986e21fca1bbd109445e95d86dbf458095299573;hb=HEAD

and here the link to the kernel-netlink plugin which implements
configuration and management of IPsec Policies and SAs via XFRM

http://git.strongswan.org/?p=strongswan.git;a=blob;f=src/libhydra/plugins/kernel_netlink/kernel_netlink_ipsec.c;h=06720a0f4bddf9fde60288f796df0eca647ae995;hb=HEAD

Our plugin of course relies on the ipsec.h, netlink.h, rtnetlink.h,
and xfrm.h Linux header files which define the API of the XFRM Netlink
kernel interface

http://git.strongswan.org/?p=strongswan.git;a=tree;f=src/include/linux;h=a41d3e9a10954c47aff2efeb06576f323c039483;hb=HEAD

Much more documentation than the Linux header files and the XFRM kernel
source code itself does not exist.

Finally a link which shows how strongSwan installs, updates, queries
and deletes IPsec Policies and SAs

http://git.strongswan.org/?p=strongswan.git;a=blob;f=src/libcharon/sa/child_sa.c;h=cda150f8736d010cf8d897071427daf8a02a337a;hb=HEAD

Just look for all "hydra->kernel_interface" function calls.

Best regards

Andreas

On 07/29/2011 07:40 AM, T C wrote:
> Hi all,
> 
> I have some questions on how IPSec logic works in the kernel.  There might be
> a difference between when XFRM was introduced and prior.  If possible,
> I like to know both scenarios.  If not, at least from XFRM perspective would
> be very helpful.
> 
> Specifically, I am interested in knowing how does IPSec obtain the initial keys
> from IKE exchange (and likely from XFRM) to set up the SA.   Also what happens
> during rekeying?  Does the SA have to be terminated first, or somehow it can be
> rekey'ed and continue as the same SA?  I'll be using strongswan for IKE.
> 
> Function names and if possible some flow graphs would be greatly appreciated.
> 
> Thanks,
> Terry
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
======================================================================
Andreas Steffen                         andreas.steffen@strongswan.org
strongSwan - the Linux VPN Solution!                www.strongswan.org
Institute for Internet Technologies and Applications
University of Applied Sciences Rapperswil
CH-8640 Rapperswil (Switzerland)
===========================================================[ITA-HSR]==

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox