Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: TCP sequence number inference attack on Linux
From: Zhiyun Qian @ 2012-12-21 23:52 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1356129948.21834.8002.camel@edumazet-glaptop>

On Fri, Dec 21, 2012 at 5:45 PM, Eric Dumazet <erdnetdev@gmail.com> wrote:
> On Fri, 2012-12-21 at 14:49 -0500, Zhiyun Qian wrote:
>
>> If I am not mistaken, line 6142 in kernel v3.7.1 corresponds to
>> tcp_rcv_state_process(). According to the comments, "This function
>> implements the receiving procedure of RFC 793 for all states except
>> ESTABLISHED and TIME_WAIT." Are you referring to a different kernel
>> version?
>
> You are not mistaken, it seems code is too permissive.
>
> We should reject a frame without ACK flag while in ESTABLISHED state.
>
> Thats explicitly stated in RFC 973.
>
> Then we should make all possible safety checks before even sending a
> frame or changing socket variables.

I completely agree!

>
> (For instance the tests done in tcp_ack() should be done before calling
> tcp_validate_incoming())
>

It seems that it is not straightforward to simply move tcp_ack()
before tcp_validate_incoming() as tcp_ack() currently assumes the tcp
sequence number is already validated and it may adjust certain states
purely depending on the ACK number. I guess the solution is to extract
all safety checks out and put at the very beginning. The rest of the
code in tcp_validate_incoming() and tcp_ack() may still need to
perform the redundant checks since if some state changes are dependent
on the sequence number or ACK number (e.g., window update).

I'm willing to help on this. Perhaps I can draft an initial patch and
you can help take a look before I submit it?

> John Dykstra in commit 96e0bf4b5193d0 (tcp: Discard segments that ack
> data not yet sent) did a step into right direction, but missed this.
>
> Current code assumes the incoming frame is mostly legitimate.
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index a136925..2ea4937 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -5551,7 +5551,7 @@ slow_path:
>                 return 0;
>
>  step5:
> -       if (th->ack && tcp_ack(sk, skb, FLAG_SLOWPATH) < 0)
> +       if (!th->ack || tcp_ack(sk, skb, FLAG_SLOWPATH) < 0)
>                 goto discard;
>
>         /* ts_recent update must be made after we are sure that the packet
>
>

Neat change. This should enforce the ACK flag and ACK number check for
every packet received in established state.

^ permalink raw reply

* Re: TCP sequence number inference attack on Linux
From: Eric Dumazet @ 2012-12-22  0:01 UTC (permalink / raw)
  To: Zhiyun Qian; +Cc: netdev
In-Reply-To: <CALvgte-tpvyBF608MahqLc_JFZNzg2q1FpsUY72fDDicbdsBWA@mail.gmail.com>

On Fri, 2012-12-21 at 18:52 -0500, Zhiyun Qian wrote:

> It seems that it is not straightforward to simply move tcp_ack()
> before tcp_validate_incoming() as tcp_ack() currently assumes the tcp
> sequence number is already validated and it may adjust certain states
> purely depending on the ACK number. I guess the solution is to extract
> all safety checks out and put at the very beginning. The rest of the
> code in tcp_validate_incoming() and tcp_ack() may still need to
> perform the redundant checks since if some state changes are dependent
> on the sequence number or ACK number (e.g., window update).
> 
> I'm willing to help on this. Perhaps I can draft an initial patch and
> you can help take a look before I submit it?

I began coding a serie of patches.

Part of the problem comes from a 'fast path' idea that was nice 20 years
ago but no longer a real killer these days.

So some tests are duplicated, others are not correctly done.

For example, the fast path misses the more complete tests done in
tcp_ack()

Its a big mess.

^ permalink raw reply

* Re: TCP sequence number inference attack on Linux
From: Eric Dumazet @ 2012-12-22  0:04 UTC (permalink / raw)
  To: Zhiyun Qian; +Cc: netdev
In-Reply-To: <1356134481.21834.8090.camel@edumazet-glaptop>

On Fri, 2012-12-21 at 16:01 -0800, Eric Dumazet wrote:

> I began coding a serie of patches.

My goal is to make very small patches, to ease code review, and
eventually come to a state where we have more factorized code.

^ permalink raw reply

* Re: TCP sequence number inference attack on Linux
From: Zhiyun Qian @ 2012-12-22  0:08 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1356134676.21834.8096.camel@edumazet-glaptop>

I see. If you need any help, just let me know! E.g., I can definitely
help review the patches.

-Zhiyun

On Fri, Dec 21, 2012 at 7:04 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2012-12-21 at 16:01 -0800, Eric Dumazet wrote:
>
>> I began coding a serie of patches.
>
> My goal is to make very small patches, to ease code review, and
> eventually come to a state where we have more factorized code.
>
>
>

^ permalink raw reply

* Re: [PATCH] net: sched: integer overflow fix
From: Eric Dumazet @ 2012-12-22  0:11 UTC (permalink / raw)
  To: Stefan Hasko
  Cc: Jamal Hadi Salim, David S. Miller, netdev, linux-kernel,
	Vimalkumar
In-Reply-To: <1356130308.21834.8014.camel@edumazet-glaptop>

On Fri, 2012-12-21 at 14:51 -0800, Eric Dumazet wrote:
> On Fri, 2012-12-21 at 21:39 +0100, Stefan Hasko wrote:
> > Fixed integer overflow in function htb_dequeue
> > 
> > Signed-off-by: Stefan Hasko <hasko.stevo@gmail.com>
> > ---
> >  net/sched/sch_htb.c |    2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
> > index d2922c0..1bd3faa 100644
> > --- a/net/sched/sch_htb.c
> > +++ b/net/sched/sch_htb.c
> > @@ -919,7 +919,7 @@ ok:
> >  	q->now = ktime_to_ns(ktime_get());
> >  	start_at = jiffies;
> >  
> > -	next_event = q->now + 5 * NSEC_PER_SEC;
> > +	next_event = q->now + (u32)5 * NSEC_PER_SEC;
> >  
> >  	for (level = 0; level < TC_HTB_MAXDEPTH; level++) {
> >  		/* common case optimization - skip event handler quickly */
> 
> But this patch is wrong !
> 

Please resend your patch using something like

It will work much better.

diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index d2922c0..51561ea 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -919,7 +919,7 @@ ok:
 	q->now = ktime_to_ns(ktime_get());
 	start_at = jiffies;
 
-	next_event = q->now + 5 * NSEC_PER_SEC;
+	next_event = q->now + 5LLU * NSEC_PER_SEC;
 
 	for (level = 0; level < TC_HTB_MAXDEPTH; level++) {
 		/* common case optimization - skip event handler quickly */

^ permalink raw reply related

* Re: [RFC] IP_MAX_MTU value
From: Alexey Kuznetsov @ 2012-12-21 23:45 UTC (permalink / raw)
  To: David Miller; +Cc: erdnetdev, netdev, rick.jones2
In-Reply-To: <20121221.115910.1536088448298730991.davem@davemloft.net>

On Fri, Dec 21, 2012 at 11:59:10AM -0800, David Miller wrote:
> From: Eric Dumazet <erdnetdev@gmail.com>
> > Changing IP_MAX_MTU from 0xFFF0 to 0x10000 seems safe [1], but I might
> > miss something really obvious ?

Maximal IP MTU is 65535 (due to iph->len).

> > It might be because in old days we reserved 16 bytes for the ethernet
> > header, and we wanted to avoid kmalloc() round-up to kmalloc-131072
> > slab ?
> > 
> > If so, we certainly can limit skb->head to 32 or 64 KB and complete with
> > page fragments the remaining space.
> 
> I don't think it has to do with kmalloc() at all.
> 
> Maybe something strange to do with the fact that each frag has to
> be an 8-byte multiple, or something like that?
> 
> Alexey choose this value back in 1998, maybe he remembers the
> reason for the strange value.

Vaguely. I do not think it was something clever.

Apparently, this was done while implementing IPv6 jumbo frames.
dev->mtu was allowed to be arbitrarily high and for use with IPv4 it had
to be clamped at some reasonable value. This value should be 65535, of course.

Here was a problem: apparently, when experimenting with jumbo frames I used
virtual device (loopback?) setting some crazy values for mtu there. And it looked
dubious that nicely aligned device mtu sort of 256K converted to perverted odd
value 65535 for IP. This value would be translated to odd value for mss and so on.
Nothing fatal, but forcing tcp segmentation to an odd value looked too weird.

So, I just decided to apply some bound which looked sane, safe and aligned. :-)
Note that values of mtu > 65280 (or something like that, which was used by some funny hippi device)
were purely of theoritical value, they were used only to experiment with jumbo frames or for stress
testing, so actual value did not have any meaning.

^ permalink raw reply

* [PATCH] net: sched: integer overflow fix
From: Stefan Hasko @ 2012-12-22  1:04 UTC (permalink / raw)
  To: Jamal Hadi Salim, David S. Miller, netdev; +Cc: linux-kernel, Stefan Hasko

Sorry, I did not realize different sizes casting problem, now it's clear to me. Thanks for help.

Fixed integer overflow in function htb_dequeue

Signed-off-by: Stefan Hasko <hasko.stevo@gmail.com>
---
 net/sched/sch_htb.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index d2922c0..51561ea 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -919,7 +919,7 @@ ok:
 	q->now = ktime_to_ns(ktime_get());
 	start_at = jiffies;
 
-	next_event = q->now + 5 * NSEC_PER_SEC;
+	next_event = q->now + 5LLU * NSEC_PER_SEC;
 
 	for (level = 0; level < TC_HTB_MAXDEPTH; level++) {
 		/* common case optimization - skip event handler quickly */
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH] net: sched: integer overflow fix
From: Eric Dumazet @ 2012-12-22  1:16 UTC (permalink / raw)
  To: Stefan Hasko; +Cc: Jamal Hadi Salim, David S. Miller, netdev, linux-kernel
In-Reply-To: <1356138300-16076-1-git-send-email-hasko.stevo@gmail.com>

On Sat, 2012-12-22 at 02:04 +0100, Stefan Hasko wrote:
> Sorry, I did not realize different sizes casting problem, now it's clear to me. Thanks for help.
> 
> Fixed integer overflow in function htb_dequeue
> 
> Signed-off-by: Stefan Hasko <hasko.stevo@gmail.com>
> ---
>  net/sched/sch_htb.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
> index d2922c0..51561ea 100644
> --- a/net/sched/sch_htb.c
> +++ b/net/sched/sch_htb.c
> @@ -919,7 +919,7 @@ ok:
>  	q->now = ktime_to_ns(ktime_get());
>  	start_at = jiffies;
>  
> -	next_event = q->now + 5 * NSEC_PER_SEC;
> +	next_event = q->now + 5LLU * NSEC_PER_SEC;
>  
>  	for (level = 0; level < TC_HTB_MAXDEPTH; level++) {
>  		/* common case optimization - skip event handler quickly */

I guess David will remove the first line of your changelog

Acked-by: Eric Dumazet <edumazet@google.com>

Thanks

^ permalink raw reply

* [PATCH 03/25] sja1000: don't use [delayed_]work_pending()
From: Tejun Heo @ 2012-12-22  1:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: Tejun Heo, Wolfgang Grandegger, David S. Miller, netdev
In-Reply-To: <1356141435-17340-1-git-send-email-tj@kernel.org>

There's no need to test whether a (delayed) work item in pending
before queueing, flushing or cancelling it.  Most uses are unnecessary
and quite a few of them are buggy.

Remove unnecessary pending tests from sja1000.  Only compile tested.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Wolfgang Grandegger <wg@grandegger.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
---
Please let me know how this patch should be routed.  I can take it
through the workqueue tree if necessary.

Thanks.

 drivers/net/can/sja1000/peak_pci.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/can/sja1000/peak_pci.c b/drivers/net/can/sja1000/peak_pci.c
index d84888f..600ac72 100644
--- a/drivers/net/can/sja1000/peak_pci.c
+++ b/drivers/net/can/sja1000/peak_pci.c
@@ -339,8 +339,7 @@ static void peak_pciec_set_leds(struct peak_pciec_card *card, u8 led_mask, u8 s)
  */
 static void peak_pciec_start_led_work(struct peak_pciec_card *card)
 {
-	if (!delayed_work_pending(&card->led_work))
-		schedule_delayed_work(&card->led_work, HZ);
+	schedule_delayed_work(&card->led_work, HZ);
 }
 
 /*
-- 
1.8.0.2

^ permalink raw reply related

* Re: TCP sequence number inference attack on Linux
From: Hannes Frederic Sowa @ 2012-12-22  2:13 UTC (permalink / raw)
  To: Zhiyun Qian; +Cc: Eric Dumazet, netdev
In-Reply-To: <CALvgte-=v=bRnQ2HPDxyqxBOjD6DXp_BUySW6Hm7h+81M+RYMA@mail.gmail.com>

On Fri, Dec 21, 2012 at 02:13:12PM -0500, Zhiyun Qian wrote:
> That seems like a good idea. I am not sure how it is implemented
> though. Is it a new feature of Linux? Would you mind sending some
> pointers for this?

You can check if network namespaces are in use by looking at about:sandbox in
chrome. It should be enabled by default (don't know about chromium, though).

^ permalink raw reply

* Re: IPv6 over Firewire
From: Stephan Gatzka @ 2012-12-22  6:03 UTC (permalink / raw)
  To: Stefan Richter; +Cc: YOSHIFUJI Hideaki, netdev, linux1394-devel
In-Reply-To: <20121222001230.3fe92bef@stein>


>
> Can't we increase dev->addr_len for RFC 3146 interfaces?
> Can't we add another dev->type besides ARPHRD_IEEE1394 (RFC 2734)?
>
> Is a single dev instance transporting both IPv4 and IPv6 or will there be
> separate instances for those?
>

Right now there is a single dev instance for both IPv4 and IPv6. Two 
instances means there will be two separate device, doesn't it?

Stephan

------------------------------------------------------------------------------
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d

^ permalink raw reply

* Re: IPv6 over Firewire
From: Stephan Gatzka @ 2012-12-22  6:10 UTC (permalink / raw)
  To: YOSHIFUJI Hideaki; +Cc: netdev, linux1394-devel
In-Reply-To: <50D4BD2F.7060006@linux-ipv6.org>

> Something like this:
>
>   static inline int ndisc_opt_addr_space(struct net_device *dev)
>   {
> -       return NDISC_OPT_SPACE(dev->addr_len + ndisc_addr_option_pad(dev->type));
> +       switch (dev->type) {
> +       case ARPHRD_IEEE1394:
> +               return sizeof(struct ndisc_opt_ieee1394_llinfo);
> +       default:
> +               return NDISC_OPT_SPACE(dev->addr_len + ndisc_addr_option_pad(dev->type));
> +       }
>   }
>
> --yoshfuji
>

O.k., this has the advantage that only ndisc packets get some more 
memory, but the question is if we are under such a hard memory pressure 
that we don't allow that.

Your solution has the disadvantage that now I have to publish struct 
ndisc_opt_ieee1394_llinfo to the ndisc stuff. Nobody in ndisc.c really 
wants to deal with that structure, only the size is of interest. So 
keeping this struct private is less invasive to the rest of linux. Just 
my two cents.

Regards,

Stephan

^ permalink raw reply

* Re: [PATCH 03/25] sja1000: don't use [delayed_]work_pending()
From: David Miller @ 2012-12-22  8:01 UTC (permalink / raw)
  To: tj; +Cc: linux-kernel, wg, netdev
In-Reply-To: <1356141435-17340-4-git-send-email-tj@kernel.org>

From: Tejun Heo <tj@kernel.org>
Date: Fri, 21 Dec 2012 17:56:53 -0800

> There's no need to test whether a (delayed) work item in pending
> before queueing, flushing or cancelling it.  Most uses are unnecessary
> and quite a few of them are buggy.
> 
> Remove unnecessary pending tests from sja1000.  Only compile tested.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Wolfgang Grandegger <wg@grandegger.com>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: netdev@vger.kernel.org
> ---
> Please let me know how this patch should be routed.  I can take it
> through the workqueue tree if necessary.

I would suggest just taking it via the workqueue tree, thanks Tejun.

^ permalink raw reply

* Re: [PATCH] net: sched: integer overflow fix
From: David Miller @ 2012-12-22  8:03 UTC (permalink / raw)
  To: eric.dumazet; +Cc: hasko.stevo, jhs, netdev, linux-kernel
In-Reply-To: <1356138997.21834.8185.camel@edumazet-glaptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 21 Dec 2012 17:16:37 -0800

> On Sat, 2012-12-22 at 02:04 +0100, Stefan Hasko wrote:
>> Sorry, I did not realize different sizes casting problem, now it's clear to me. Thanks for help.
>> 
>> Fixed integer overflow in function htb_dequeue
>> 
>> Signed-off-by: Stefan Hasko <hasko.stevo@gmail.com>
>> ---
>>  net/sched/sch_htb.c |    2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
>> index d2922c0..51561ea 100644
>> --- a/net/sched/sch_htb.c
>> +++ b/net/sched/sch_htb.c
>> @@ -919,7 +919,7 @@ ok:
>>  	q->now = ktime_to_ns(ktime_get());
>>  	start_at = jiffies;
>>  
>> -	next_event = q->now + 5 * NSEC_PER_SEC;
>> +	next_event = q->now + 5LLU * NSEC_PER_SEC;
>>  
>>  	for (level = 0; level < TC_HTB_MAXDEPTH; level++) {
>>  		/* common case optimization - skip event handler quickly */
> 
> I guess David will remove the first line of your changelog
> 
> Acked-by: Eric Dumazet <edumazet@google.com>

Applied.

^ permalink raw reply

* Re: [PATCH] CONFIG_HOTPLUG removal from networking core
From: David Miller @ 2012-12-22  8:03 UTC (permalink / raw)
  To: gregkh; +Cc: netdev, wfp5p
In-Reply-To: <20121221234429.GB13447@kroah.com>

From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date: Fri, 21 Dec 2012 15:44:29 -0800

> CONFIG_HOTPLUG is always enabled now, so remove the unused code that was
> trying to be compiled out when this option was disabled, in the
> networking core.
> 
> Cc: Bill Pemberton <wfp5p@virginia.edu>
> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Applied.

^ permalink raw reply

* Re: [PATCH] Drivers: network: more __dev* removal
From: David Miller @ 2012-12-22  8:03 UTC (permalink / raw)
  To: gregkh; +Cc: netdev, wfp5p
In-Reply-To: <20121221234215.GA13447@kroah.com>

From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date: Fri, 21 Dec 2012 15:42:15 -0800

> Remove some __dev* markings that snuck in the 3.8-rc1 merge window in
> the drivers/net/* directory.
> 
> Cc: Bill Pemberton <wfp5p@virginia.edu>
> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Applied.

^ permalink raw reply

* Re: IPv6 over Firewire
From: Stefan Richter @ 2012-12-22  9:15 UTC (permalink / raw)
  To: stephan.gatzka; +Cc: YOSHIFUJI Hideaki, netdev, linux1394-devel
In-Reply-To: <50D54ED9.6090908@gmail.com>

On Dec 22 Stephan Gatzka wrote:
> 
> > Something like this:
> >
> >   static inline int ndisc_opt_addr_space(struct net_device *dev)
> >   {
> > -       return NDISC_OPT_SPACE(dev->addr_len + ndisc_addr_option_pad(dev->type));
> > +       switch (dev->type) {
> > +       case ARPHRD_IEEE1394:
> > +               return sizeof(struct ndisc_opt_ieee1394_llinfo);
> > +       default:
> > +               return NDISC_OPT_SPACE(dev->addr_len + ndisc_addr_option_pad(dev->type));
> > +       }
> >   }
> >
> > --yoshfuji
> >
> 
> O.k., this has the advantage that only ndisc packets get some more 
> memory, but the question is if we are under such a hard memory pressure 
> that we don't allow that.
> 
> Your solution has the disadvantage that now I have to publish struct 
> ndisc_opt_ieee1394_llinfo to the ndisc stuff. Nobody in ndisc.c really 
> wants to deal with that structure, only the size is of interest. So 
> keeping this struct private is less invasive to the rest of linux. Just 
> my two cents.

You could add another case to include/net/ndisc.h::ndisc_addr_option_pad()
with a hardcoded size, couldn't you?
-- 
Stefan Richter
-=====-===-- ==-- =-==-
http://arcgraph.de/sr/

^ permalink raw reply

* Re: [PATCH] pkt_sched: act_xt support new Xtables interface
From: Jamal Hadi Salim @ 2012-12-22 13:19 UTC (permalink / raw)
  To: Yury Stankevich
  Cc: Hasan Chowdhury, Stephen Hemminger, Jan Engelhardt,
	netdev@vger.kernel.org, pablo, netfilter-devel
In-Reply-To: <50D46EC1.2040608@gmail.com>

On 12-12-21 09:14 AM, Yury Stankevich wrote:
>
> well.
> let me describe whole picture i want to achieve
>

I think i got what you are trying to do Yury. Clever.
 From the description Jan provided in his response, I dont
think this used to work at all. Are you saying it worked before?

Having said that, what you are doing sounds so useful
that we need to make it work ;-> But it appears like
we need a brand new action for it, something like
GetMarkFromConntrack. Jan, I am assuming (on ingress only)
we need to call "something" to give us the nfct then
grab the skb->mark from nfct. On egress,
I am assuming the skb->mark is already set if connmark
is to be used... Am i correct?
If yes, then this action will only be useful at ingress.

cheers,
jamal

^ permalink raw reply

* Re: [PATCH] pkt_sched: act_xt support new Xtables interface
From: Jamal Hadi Salim @ 2012-12-22 13:42 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jan Engelhardt, Yury Stankevich, Hasan Chowdhury,
	Stephen Hemminger, netdev@vger.kernel.org, pablo, netfilter-devel
In-Reply-To: <1356104720.21834.7497.camel@edumazet-glaptop>

On 12-12-21 10:45 AM, Eric Dumazet wrote:
> On Fri, 2012-12-21 at 15:35 +0100, Jan Engelhardt wrote:
>

> This reminds me this might be the reason we have
> skb_reset_transport_header(skb);
> in __netif_receive_skb(), while its not very logical.
>

You seem to have nailed the egress part finally. That has
been a constant battle. At one point the standard answer
was "turn off TSO" ;->

> (Yes, sorry for being off topic, but I am referring to
> http://www.spinics.net/lists/netdev/msg214662.html )


I think the skb_reset_transport_header() when Acme made
a major overhaul to replace direct pointer access.
For this reason i think your second option seems preferable.

cheers,
jamal


^ permalink raw reply

* Re: [PATCH] pkt_sched: act_xt support new Xtables interface
From: Jan Engelhardt @ 2012-12-22 13:43 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Yury Stankevich, Hasan Chowdhury, Stephen Hemminger,
	netdev@vger.kernel.org, pablo, netfilter-devel
In-Reply-To: <50D5B366.30005@mojatatu.com>

On Saturday 2012-12-22 14:19, Jamal Hadi Salim wrote:
>
> Having said that, what you are doing sounds so useful
> that we need to make it work ;-> But it appears like
> we need a brand new action for it, something like
> GetMarkFromConntrack. Jan, I am assuming (on ingress only)
> we need to call "something" to give us the nfct then
> grab the skb->mark from nfct.

Looking up CT before ingress would mean the entire "raw"
table needs to be moved before ingress. But with classic
ip_tables, calling a table requires a lot of setup
(basically ip_rcv).

> On egress,
> I am assuming the skb->mark is already set if connmark
> is to be used... Am i correct?

All new skbs (i.e. those that did not loop due to IPsec, for example)
received through __netif_receive_skb should start out with
skb->mark=0, which is why CONNMARK --restore-mark is needed
to copy skb->mark=ct->mark.

^ permalink raw reply

* Re: [PATCH] pkt_sched: act_xt support new Xtables interface
From: Jamal Hadi Salim @ 2012-12-22 13:56 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Yury Stankevich, Hasan Chowdhury, Stephen Hemminger,
	netdev@vger.kernel.org, pablo, netfilter-devel
In-Reply-To: <alpine.LNX.2.01.1212221430280.9993@nerf07.vanv.qr>

On 12-12-22 08:43 AM, Jan Engelhardt wrote:

>
> Looking up CT before ingress would mean the entire "raw"
> table needs to be moved before ingress. But with classic
> ip_tables, calling a table requires a lot of setup
> (basically ip_rcv).

Scanning the code:
Would it not work if i only passed it IP packets (the tc
classifier can check) and then for v4 i do something like
ipv4_conntrack_in() with pre-routing as the hook to update
the skb?

> All new skbs (i.e. those that did not loop due to IPsec, for example)
> received through __netif_receive_skb should start out with
> skb->mark=0, which is why CONNMARK --restore-mark is needed
> to copy skb->mark=ct->mark.

I  may be overthinking this: are you saying connmark should do the
copying to skb->mark instead of some action? Earlier you said
conmark depends on presence of skb->nfct.

cheers,
jamal

^ permalink raw reply

* Re: [PATCH] pkt_sched: act_xt support new Xtables interface
From: Yury Stankevich @ 2012-12-22 13:58 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Hasan Chowdhury, Stephen Hemminger, Jan Engelhardt,
	netdev@vger.kernel.org, pablo, netfilter-devel
In-Reply-To: <50D5B366.30005@mojatatu.com>

22.12.2012 17:19, Jamal Hadi Salim пишет:
> From the description Jan provided in his response, I dont
> think this used to work at all. Are you saying it worked before?

no.
i'm trying if this can work, alas. it can't.

> Having said that, what you are doing sounds so useful
> that we need to make it work ;-> But it appears like
> we need a brand new action for it, something like
> GetMarkFromConntrack. 

maybe ifb device can be made more friendly to iptables ?
for a sample, run some (or all?) nefilter hooks before qdisc, like on a
normal interface ?


-- 
Linux registered user #402966 // pub 1024D/E99AF373 <pgp.mit.edu>
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] pkt_sched: act_xt support new Xtables interface
From: Florian Westphal @ 2012-12-22 14:04 UTC (permalink / raw)
  To: Yury Stankevich
  Cc: Jamal Hadi Salim, Hasan Chowdhury, Stephen Hemminger,
	Jan Engelhardt, netdev@vger.kernel.org, pablo, netfilter-devel
In-Reply-To: <50D5BC96.9010602@gmail.com>

Yury Stankevich <urykhy@gmail.com> wrote:
> 22.12.2012 17:19, Jamal Hadi Salim пишет:
> > From the description Jan provided in his response, I dont
> > think this used to work at all. Are you saying it worked before?
> 
> no.
> i'm trying if this can work, alas. it can't.

Yury, what are you trying to accomplish?
Is there a particular reason why you want to use ingress shaping
instead of pure policing?

I ask, because you could try to use hashlimit match to do
rate policing via netfilter.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] pkt_sched: act_xt support new Xtables interface
From: Jamal Hadi Salim @ 2012-12-22 14:09 UTC (permalink / raw)
  To: Yury Stankevich
  Cc: Hasan Chowdhury, Stephen Hemminger, Jan Engelhardt,
	netdev@vger.kernel.org, pablo, netfilter-devel
In-Reply-To: <50D5BC96.9010602@gmail.com>

On 12-12-22 08:58 AM, Yury Stankevich wrote:

> i'm trying if this can work, alas. it can't.

Now i want it to work ;-> So dont give up yet.


cheers,
jamal


^ permalink raw reply

* Re: Regression in 3.8-rc1:  "BUG: sleeping function called from invalid context"
From: Borislav Petkov @ 2012-12-22 18:02 UTC (permalink / raw)
  To: Larry Finger; +Cc: LKML, Christoph Lameter, Pekka Enberg, netdev
In-Reply-To: <50D5E9D9.3070904@lwfinger.net>

Top-posting so that the rest can remain untouched.

Right, so AFAICT, something is holding rtnl_mutex (probably some
rtnetlink traffic) and device_rename() is doing kstrdup with
GFP_KERNEL which, among others, has __GFP_WAIT and *that* triggers the
might_sleep_if() check in slab_pre_alloc_hook():

static inline int slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags)
{
        flags &= gfp_allowed_mask;
        lockdep_trace_alloc(flags);
        might_sleep_if(flags & __GFP_WAIT);		<--- HERE

Adding Christoph and Pekka although the slub.c might_sleep stuff is from
2010. Still, they might have a better idea.

Oh well, let's add netdev while we're at it. :-)

HTH.

On Sat, Dec 22, 2012 at 11:11:53AM -0600, Larry Finger wrote:
> With kernel 3.8-rc1, I get 2 "BUG: sleeping function called from
> invalid context" reports. These have been present got some time in
> the 3.7-git versions and I have tried twice to bisect the problem.
> Both times, I ended up at a merge commit. The most recent found
> commit a11da7d as the bad one, and commit d7460f4 as the last good
> one. I have not had time to make a third try.
> 
> My system is x86_64 running on an HP laptop with a dual-core AMD
> CPU. My configuration file is attached.
> 
> The logged details are as follows:
> 
> [   31.715016] BUG: sleeping function called from invalid context at mm/slub.c:925
> [   31.715022] in_atomic(): 1, irqs_disabled(): 0, pid: 2129, name: udevd
> [   31.715025] 2 locks held by udevd/2129:
> [   31.715028]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff81386382>]
> rtnl_lock+0x12/0x20
> [   31.715041]  #1:  (devnet_rename_seq){+.+.+.}, at:
> [<ffffffff81376033>] dev_change_name+0x43/0x260
> [   31.715053] Pid: 2129, comm: udevd Not tainted 3.8.0-rc1 #56
> [   31.715056] Call Trace:
> [   31.715063]  [<ffffffff81076ca2>] __might_sleep+0x152/0x250
> [   31.715068]  [<ffffffff8114a353>] __kmalloc_track_caller+0x103/0x280
> [   31.715073]  [<ffffffff812d595d>] ? device_rename+0x4d/0xf0
> [   31.715078]  [<ffffffff8111d675>] kstrdup+0x35/0x70
> [   31.715082]  [<ffffffff812d595d>] device_rename+0x4d/0xf0
> [   31.715086]  [<ffffffff813760ca>] dev_change_name+0xda/0x260
> [   31.715091]  [<ffffffff81377f51>] dev_ifsioc+0x241/0x3a0
> [   31.715095]  [<ffffffff81378410>] dev_ioctl+0x360/0x830
> [   31.715101]  [<ffffffff810a54cd>] ? trace_hardirqs_on+0xd/0x10
> [   31.715106]  [<ffffffff8135b711>] sock_do_ioctl.constprop.41+0x41/0x50
> [   31.715109]  [<ffffffff8135b9c6>] sock_ioctl+0x66/0x2b0
> [   31.715115]  [<ffffffff811637b7>] do_vfs_ioctl+0x97/0x580
> [   31.715119]  [<ffffffff8116f27a>] ? fget_light+0x3da/0x4d0
> [   31.715124]  [<ffffffff81422a55>] ? sysret_check+0x22/0x5d
> [   31.715128]  [<ffffffff81163ceb>] sys_ioctl+0x4b/0x90
> [   31.715133]  [<ffffffff8121a69e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [   31.715136]  [<ffffffff81422a29>] system_call_fastpath+0x16/0x1b
> [   31.715764] BUG: scheduling while atomic: udevd/2129/0x00000002
> [   31.715768] 2 locks held by udevd/2129:
> [   31.715769]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff81386382>]
> rtnl_lock+0x12/0x20
> [   31.715779]  #1:  (devnet_rename_seq){+.+.+.}, at:
> [<ffffffff81376033>] dev_change_name+0x43/0x260
> [   31.715787] Modules linked in: b43 arc4 rtl8723ae rtlwifi
> mac80211 snd_hda_codec_conexant snd_hda_intel snd_hda_codec cfg80211
> powernow_k8 kvm_amd snd_pcm_oss kvm snd_pcm snd_seq bcma rng_core
> ssb snd_timer mmc_core snd_seq_device pcmcia rfkill snd sr_mod cdrom
> soundcore ehci_pci battery pcmcia_core sg k8temp ac i2c_nforce2
> hwmon forcedeth video snd_page_alloc serio_raw i2c_core joydev wmi
> button ipv6 autofs4 ext4 mbcache jbd2 crc16 ohci_hcd ehci_hcd
> usbcore usb_common thermal processor scsi_dh_rdac scsi_dh_alua
> scsi_dh_emc scsi_dh_hp_sw scsi_dh ata_generic pata_amd
> [   31.715867] Pid: 2129, comm: udevd Not tainted 3.8.0-rc1 #56
> [   31.715868] Call Trace:
> [   31.715874]  [<ffffffff8141987c>] __schedule_bug+0x62/0x70
> [   31.715878]  [<ffffffff81420160>] __schedule+0x730/0xa30
> [   31.715883]  [<ffffffff810a3982>] ? __lock_acquire+0x12d2/0x1c50
> [   31.715888]  [<ffffffff81420744>] schedule+0x24/0x70
> [   31.715893]  [<ffffffff8141dc5c>] schedule_timeout+0x18c/0x2f0
> [   31.715897]  [<ffffffff810a52ac>] ? mark_held_locks+0x8c/0x110
> [   31.715902]  [<ffffffff81421b9b>] ? _raw_spin_unlock_irq+0x2b/0x50
> [   31.715906]  [<ffffffff810a5435>] ? trace_hardirqs_on_caller+0x105/0x190
> [   31.715911]  [<ffffffff810a54cd>] ? trace_hardirqs_on+0xd/0x10
> [   31.715915]  [<ffffffff814205d5>] wait_for_common+0xe5/0x180
> [   31.715919]  [<ffffffff8107d010>] ? try_to_wake_up+0x2d0/0x2d0
> [   31.715924]  [<ffffffff81420718>] wait_for_completion+0x18/0x20
> [   31.715929]  [<ffffffff8105f48c>] call_usermodehelper_exec+0x19c/0x1d0
> [   31.715935]  [<ffffffff8105f592>] call_usermodehelper_fns+0xd2/0x100
> [   31.715941]  [<ffffffff8121211d>] kobject_uevent_env+0x47d/0x4b0
> [   31.715946]  [<ffffffff812110df>] kobject_rename+0x12f/0x140
> [   31.715951]  [<ffffffff812d59db>] device_rename+0xcb/0xf0
> [   31.715955]  [<ffffffff813760ca>] dev_change_name+0xda/0x260
> [   31.715960]  [<ffffffff81377f51>] dev_ifsioc+0x241/0x3a0
> [   31.715965]  [<ffffffff81378410>] dev_ioctl+0x360/0x830
> [   31.715969]  [<ffffffff810a54cd>] ? trace_hardirqs_on+0xd/0x10
> [   31.715974]  [<ffffffff8135b711>] sock_do_ioctl.constprop.41+0x41/0x50
> [   31.715978]  [<ffffffff8135b9c6>] sock_ioctl+0x66/0x2b0
> [   31.715982]  [<ffffffff811637b7>] do_vfs_ioctl+0x97/0x580
> [   31.715987]  [<ffffffff8116f27a>] ? fget_light+0x3da/0x4d0
> [   31.715991]  [<ffffffff81422a55>] ? sysret_check+0x22/0x5d
> [   31.716092]  [<ffffffff81163ceb>] sys_ioctl+0x4b/0x90
> [   31.716097]  [<ffffffff8121a69e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [   31.716102]  [<ffffffff81422a29>] system_call_fastpath+0x16/0x1b



-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox