Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [Bugme-new] [Bug 33842] New: NULL pointer dereference in ip_fragment
From: Eric Dumazet @ 2011-04-26 20:53 UTC (permalink / raw)
  To: David Miller; +Cc: bandan.das, netdev, akpm, tom
In-Reply-To: <20110426.134637.48491363.davem@davemloft.net>

Le mardi 26 avril 2011 à 13:46 -0700, David Miller a écrit :

> This patch is mangled by your email client, tab characters
> have been turned into spaces, so it won't be usable by anyone.

Thats strange, I thought it was already in linux-2.6 anyway ?




^ permalink raw reply

* Re: [Bugme-new] [Bug 33842] New: NULL pointer dereference in ip_fragment
From: Bandan Das @ 2011-04-26 20:59 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, bandan.das, netdev, akpm, tom
In-Reply-To: <1303851185.2699.7.camel@edumazet-laptop>

On  0, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 26 avril 2011 à 13:46 -0700, David Miller a écrit :
> 
> > This patch is mangled by your email client, tab characters
> > have been turned into spaces, so it won't be usable by anyone.
> 
> Thats strange, I thought it was already in linux-2.6 anyway ?
> 
Umm.. I could be wrong! I just did a quick grep for your name in the 
2.6.39-rc4 changelog : 
http://www.kernel.org/pub/linux/kernel/v2.6/testing/ChangeLog-2.6.39-rc4

and didn't find it there.

--
Bandan

^ permalink raw reply

* Re: [Bugme-new] [Bug 33842] New: NULL pointer dereference in ip_fragment
From: Eric Dumazet @ 2011-04-26 21:01 UTC (permalink / raw)
  To: Bandan Das; +Cc: David Miller, netdev, akpm, tom
In-Reply-To: <20110426205901.GN15903@stratus.com>

Le mardi 26 avril 2011 à 16:59 -0400, Bandan Das a écrit :
> On  0, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Le mardi 26 avril 2011 à 13:46 -0700, David Miller a écrit :
> > 
> > > This patch is mangled by your email client, tab characters
> > > have been turned into spaces, so it won't be usable by anyone.
> > 
> > Thats strange, I thought it was already in linux-2.6 anyway ?
> > 
> Umm.. I could be wrong! I just did a quick grep for your name in the 
> 2.6.39-rc4 changelog : 
> http://www.kernel.org/pub/linux/kernel/v2.6/testing/ChangeLog-2.6.39-rc4
> 
> and didn't find it there.

Then it will be in rc5, dont worry ;)




^ permalink raw reply

* Re: Linux TCP's Robustness to Multipath Packet Reordering
From: Dominik Kaspar @ 2011-04-26 21:04 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Carsten Wolff, netdev
In-Reply-To: <1303850622.2699.6.camel@edumazet-laptop>

On Tue, Apr 26, 2011 at 10:43 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le lundi 25 avril 2011 à 16:35 +0200, Dominik Kaspar a écrit :
>
>> For the experiments, all default TCP options were used, meaning that
>> SACK, DSACK, Timestamps, were all enabled. Not sure how to turn on/off
>> TSO... so that is probably enabled, too. Path emulation is done with
>> tc/netem at the receiver interfaces (eth1, eth2) with this script:
>>
>> http://home.simula.no/~kaspar/static/netem.sh
>>
>
> What are the exact parameters ? (queue size for instance)
>
> It would be nice to give detailed stats after one run, on receiver
> (since you have netem on ingress side)
>
> tc -s -d qdisc

In these experiments, a queue size of 1000 packets was specified. I am
aware that this is typically referred to as "buffer bloat" and causes
the RTT and the cwnd to grow excessively. The smaller I configure the
queues, the more time it takes for TCP to "level up" to the aggregate
throughput. By keeping the queues so large, I hope to more quickly
identify the reason why TCP is actually able to adjust to the immense
multipath reordering. What parameters could be highly relevant, other
than the queue size?

Thanks for the tip about printing tc/netem statistics after each run,
I will use "tc -s -d qdisc" next time.

Greetings,
Dominik

^ permalink raw reply

* Re: Linux TCP's Robustness to Multipath Packet Reordering
From: Eric Dumazet @ 2011-04-26 21:08 UTC (permalink / raw)
  To: Dominik Kaspar; +Cc: Carsten Wolff, netdev
In-Reply-To: <BANLkTint2-Fg35T9SqWPm3nOaoc1d=ZEnQ@mail.gmail.com>

Le mardi 26 avril 2011 à 23:04 +0200, Dominik Kaspar a écrit :

> In these experiments, a queue size of 1000 packets was specified. I am
> aware that this is typically referred to as "buffer bloat" and causes
> the RTT and the cwnd to grow excessively. The smaller I configure the
> queues, the more time it takes for TCP to "level up" to the aggregate
> throughput. By keeping the queues so large, I hope to more quickly
> identify the reason why TCP is actually able to adjust to the immense
> multipath reordering. What parameters could be highly relevant, other
> than the queue size?
> 

losses of course ;)

Real internet is full of packet losses, and probability of these losses
depends on queue sizes (RED like AQM)


> Thanks for the tip about printing tc/netem statistics after each run,
> I will use "tc -s -d qdisc" next time.
> 



^ permalink raw reply

* Re: Linux TCP's Robustness to Multipath Packet Reordering
From: Dominik Kaspar @ 2011-04-26 21:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Carsten Wolff, netdev
In-Reply-To: <1303852092.2699.11.camel@edumazet-laptop>

On Tue, Apr 26, 2011 at 11:08 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 26 avril 2011 à 23:04 +0200, Dominik Kaspar a écrit :
>
>> In these experiments, a queue size of 1000 packets was specified. I am
>> aware that this is typically referred to as "buffer bloat" and causes
>> the RTT and the cwnd to grow excessively. The smaller I configure the
>> queues, the more time it takes for TCP to "level up" to the aggregate
>> throughput. By keeping the queues so large, I hope to more quickly
>> identify the reason why TCP is actually able to adjust to the immense
>> multipath reordering. What parameters could be highly relevant, other
>> than the queue size?
>>
>
> losses of course ;)
>
> Real internet is full of packet losses, and probability of these losses
> depends on queue sizes (RED like AQM)
>

No additional random loss is introduced (yet), so packet loss happens
only when the queue size of 1000 packets is hit. Since the queues are
configured overly large, packet loss rarely happens at all... of
course at the cost of a large RTT.

I suspect that artificially bloating the RTT somehow allows TCP to
better adjust to multipath reordering... just haven't got a clue why.

Cheers,
Dominik

^ permalink raw reply

* Re: Linux TCP's Robustness to Multipath Packet Reordering
From: Eric Dumazet @ 2011-04-26 21:17 UTC (permalink / raw)
  To: Dominik Kaspar; +Cc: Carsten Wolff, netdev
In-Reply-To: <1303852092.2699.11.camel@edumazet-laptop>

Le mardi 26 avril 2011 à 23:08 +0200, Eric Dumazet a écrit :
> Le mardi 26 avril 2011 à 23:04 +0200, Dominik Kaspar a écrit :
> 
> > In these experiments, a queue size of 1000 packets was specified. I am
> > aware that this is typically referred to as "buffer bloat" and causes
> > the RTT and the cwnd to grow excessively. The smaller I configure the
> > queues, the more time it takes for TCP to "level up" to the aggregate
> > throughput. By keeping the queues so large, I hope to more quickly
> > identify the reason why TCP is actually able to adjust to the immense
> > multipath reordering. What parameters could be highly relevant, other
> > than the queue size?
> > 
> 
> losses of course ;)
> 
> Real internet is full of packet losses, and probability of these losses
> depends on queue sizes (RED like AQM)
> 
> 

BTW, netem in linux-2.6.39 contains lot of changes in netem module

commit 661b79725fea030803a89a16cda
(netem: revised correlated loss generator)

    This is a patch originated with Stefano Salsano and Fabio Ludovici.
    It provides several alternative loss models for use with netem.
    This patch adds two state machine based loss models.
    

http://netgroup.uniroma2.it/twiki/bin/view.cgi/Main/NetemCLG




^ permalink raw reply

* Re: [Bugme-new] [Bug 33842] New: NULL pointer dereference in ip_fragment
From: Bandan Das @ 2011-04-26 21:19 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Bandan Das, David Miller, netdev, akpm, tom
In-Reply-To: <1303851718.2699.8.camel@edumazet-laptop>

> > Umm.. I could be wrong! I just did a quick grep for your name in the 
> > 2.6.39-rc4 changelog : 
> > http://www.kernel.org/pub/linux/kernel/v2.6/testing/ChangeLog-2.6.39-rc4
> > 
> > and didn't find it there.
> 
> Then it will be in rc5, dont worry ;)
> 
> 
Yeah, I just rechecked and this is already in Linus' tree. So, Tomas you can
either try pulling in those changes or you can apply this patch and see
if it makes any difference. Thanks!


diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index 008ff6c..f3bc322 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -249,11 +249,9 @@ static int br_parse_ip_options(struct sk_buff *skb)
 		goto drop;
 	}
 
-	/* Zero out the CB buffer if no options present */
-	if (iph->ihl == 5) {
-		memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
+	memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
+	if (iph->ihl == 5)
 		return 0;
-	}
 
 	opt->optlen = iph->ihl*4 - sizeof(struct iphdr);
 	if (ip_options_compile(dev_net(dev), opt, skb))

^ permalink raw reply related

* Re: [Bugme-new] [Bug 33842] New: NULL pointer dereference in ip_fragment
From: Eric Dumazet @ 2011-04-26 21:24 UTC (permalink / raw)
  To: Bandan Das; +Cc: David Miller, netdev, akpm, tom
In-Reply-To: <20110426211946.GO15903@stratus.com>

Le mardi 26 avril 2011 à 17:19 -0400, Bandan Das a écrit :
> > 
> Yeah, I just rechecked and this is already in Linus' tree. So, Tomas you can
> either try pulling in those changes or you can apply this patch and see
> if it makes any difference. Thanks!

Better pull Linus tree because there is another patch involved.

(commits c65353daf137dd41f3ede3baf62d561fca076228
ip: ip_options_compile() resilient to NULL skb route)




^ permalink raw reply

* Re: Linux TCP's Robustness to Multipath Packet Reordering
From: Dominik Kaspar @ 2011-04-26 21:27 UTC (permalink / raw)
  To: John Heffner; +Cc: Eric Dumazet, Carsten Wolff, netdev
In-Reply-To: <BANLkTikoQfbSxnJi_OR+N6sa5iVNcTO6Ug@mail.gmail.com>

Hi John,

Thanks for your advice. I am very well aware that TCP is not designed
to work under such conditions. I am still surprised how well Linux TCP
handles many situations of excessive, persistent packet reordering. In
scenarios of fairly heterogeneous path characteristics, Linux TCP
aggregates multiple paths close to ideally :-)

If I'm not mistaken, cwnd moderation is a measure to prevent TCP from
sending large bursts if a single ACK covers many segments. In what way
can cwnd moderation prevent TCP from increasing its estimate of packet
reordering?

Greetings,
Dominik

On Tue, Apr 26, 2011 at 10:16 PM, John Heffner <johnwheffner@gmail.com> wrote:
> First, TCP is definitely not designed to work under such conditions.
> For example, assumptions behind RTO calculation and fast retransmit
> heuristics are violated.  However, in this particular case my first
> guess is that you are being limited by "cwnd moderation," which was
> the topic of recent discussion here.  Under persistent reordering,
> cwnd moderation can inhibit the ability of cwnd to grow.
>
> Thanks,
>  -John
>
>
> On Tue, Apr 26, 2011 at 2:00 PM, Dominik Kaspar <dokaspar.ietf@gmail.com> wrote:
>> Hi Eric,
>>
>> Here are the tcpdump files for the first TSO-disabled experiment, in a
>> full version and a short version with only the first 10000 packets:
>>
>> http://home.simula.no/~kaspar/static/mptcp-emu-wlan-hspa-01-tos0-exp1-full.pcap
>> http://home.simula.no/~kaspar/static/mptcp-emu-wlan-hspa-01-tos0-exp1-short.pcap
>>
>> By the way, the packets are sent from the server (x.x.x.189) to the
>> client interfaces (x.x.x.74) and (x.x.x.216) with the following
>> pattern (which is a non-bursty 128-bit approximation of scheduling
>> with a 600:400 ratio over primary path 0 and secondary path 1):
>>
>> 0010010100101001010010100101001010010100101001010010100101001010
>> 0101001010010100101001010010100101001010010100101001010010100101
>>
>> Greetings,
>> Dominik
>>
>> On Tue, Apr 26, 2011 at 7:10 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> Le mardi 26 avril 2011 à 18:58 +0200, Dominik Kaspar a écrit :
>>>> Hi Eric,
>>>>
>>>> On Mon, Apr 25, 2011 at 5:38 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>> >
>>>> > Since you have at sender a rule to spoof destination address of packets,
>>>> > you should make sure you dont send "super packets (up to 64Kbytes)",
>>>> > because it would stress the multipath more than you wanted to. This way,
>>>> > you send only normal packets (1500 MTU).
>>>> >
>>>> > ethtool -K eth0 tso off
>>>> > ethtool -K eth0 gso off
>>>> >
>>>> > I am pretty sure it should help your (atypic) workload.
>>>>
>>>> I made new experiments with the exact same multipath setup as before,
>>>> but disabled TSO and GSO on all involved Ethernet interfaces. However,
>>>> this did not seem to change much about TCP's behavior when packets are
>>>> striped over heterogeneous paths. You can see the results of four
>>>> 20-minute experiments on this plot:
>>>>
>>>> http://home.simula.no/~kaspar/static/mptcp-emu-wlan-hspa-01-tos0.png
>>>>
>>>> Cheers,
>>>> Dominik
>>>
>>> Hi Dominik
>>>
>>> Any chance to have a pcap file from sender side, of say first 10.000
>>> packets ?
>>>
>>>
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>

^ permalink raw reply

* Re: [PATCH net-next-2.6 0/7] SCTP updates for net-next-2.6
From: David Miller @ 2011-04-26 21:51 UTC (permalink / raw)
  To: yjwei; +Cc: netdev, linux-sctp
In-Reply-To: <20110426.001238.183056292.davem@davemloft.net>

Wei, while you are re-spinning this patch set I want to bring up
something I just noticed in the SCTP code.

The ->dst_saddr() method is not used by anything, it appears.

The ipv4 variant, sctp_v4_dst_saddr() is called internally by the
ipv4 specific code, but that's it.

So I think the ->dst_saddr member of sctp_pf can be completely
removed, as can sctp_v6_dst_saddr().

The sctp_v4_dst_saddr() function, of course, will need to be retained.

^ permalink raw reply

* RE: [RFC PATCH] netlink: Increase netlink dump skb message size
From: Rose, Gregory V @ 2011-04-26 21:58 UTC (permalink / raw)
  To: Ben Hutchings, Eric Dumazet; +Cc: David Miller, netdev@vger.kernel.org
In-Reply-To: <1303844836.2850.6.camel@bwh-desktop>

> -----Original Message-----
> From: Ben Hutchings [mailto:bhutchings@solarflare.com]
> Sent: Tuesday, April 26, 2011 12:07 PM
> To: Rose, Gregory V; Eric Dumazet
> Cc: David Miller; netdev@vger.kernel.org
> Subject: RE: [RFC PATCH] netlink: Increase netlink dump skb message size
> 
> On Tue, 2011-04-26 at 09:24 -0700, Rose, Gregory V wrote:
> > > -----Original Message-----
> > > From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
> > > Sent: Tuesday, April 26, 2011 9:21 AM
> > > To: Rose, Gregory V
> > > Cc: David Miller; netdev@vger.kernel.org; bhutchings@solarflare.com
> > > Subject: RE: [RFC PATCH] netlink: Increase netlink dump skb message
> size
> > >
> > > Le mardi 26 avril 2011 à 09:12 -0700, Rose, Gregory V a écrit :
> > >
> > > > I'm fine with however you folks want to approach this, just give me
> some
> > > direction.
> > >
> > > I would just try following patch :
> >
> > I'll test it out, it's certainly a lot simpler.  Perhaps I was getting a
> bit too fancy.
> >
> > Ben would want to make sure it works for 127 VFs, my device does 63.
> 
> I haven't been working on SR-IOV myself so I'll pass this on to a
> colleague.  Thanks, Eric.

Eric's patch works fine for the case of 63 VFs.  For most dumps it's allocating quite a bit more memory than necessary but that's not a big issue since it's transient and not held for long.

- Greg



^ permalink raw reply

* Re: [PATCH] iproute2: improve mqprio inputs for queue offsets and counts
From: Stephen Hemminger @ 2011-04-26 22:00 UTC (permalink / raw)
  To: John Fastabend; +Cc: bhutchings, netdev
In-Reply-To: <20110426194441.23726.23489.stgit@jf-dev1-dcblab>

On Tue, 26 Apr 2011 12:44:42 -0700
John Fastabend <john.r.fastabend@intel.com> wrote:

> This changes mqprio input format to be more user friendly.
> 
> Old usage,
> 
> # ./tc/tc qdisc add dev eth3 root mqprio help
> Usage: ... mqprio [num_tc NUMBER] [map P0 P1...]
>                   [offset txq0 txq1 ...] [count cnt0 cnt1 ...] [hw 1|0]
> 
> New uage,
> 
> # ./tc/tc qdisc add dev eth3 root mqprio help
> Usage: ... mqprio [num_tc NUMBER] [map P0 P1 ...]
>                   [queues count1@offset1 count2@offset2 ...] [hw 1|0]
> 
> Suggested-by: Ben Hutchings <bhutchings@solarflare.com>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>

Applied (fixed typo in commit message)

-- 

^ permalink raw reply

* rtl8169 related kernel splat in 2.6.38.4
From: Ben Greear @ 2011-04-26 22:03 UTC (permalink / raw)
  To: netdev

I see this on every (or at least most) boots:

------------[ cut here ]------------
WARNING: at /home/greearb/git/linux-2.6.dev.38.y/kernel/timer.c:983 del_timer_sync+0x25/0x37()
Hardware name: To Be Filled By O.E.M.
Modules linked in: veth 8021q garp stp llc fuse macvlan wanlink(P) pktgen coretemp hwmon nfs lockd fscache nfs_acl auth_rpcgss sunrpc ipv6 uinput arc4 ecb 
snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq ath9k snd_seq_device mac80211 snd_pcm ath9k_common ath9k_hw snd_timer ath cfg80211 snd r8169 
iTCO_wdt soundcore iTCO_vendor_support microcode i2c_i801 snd_page_alloc serio_raw mii pcspkr i915 drm_kms_helper drm i2c_algo_bit video [last unloaded: 
scsi_wait_scan]
Pid: 3224, comm: ifup Tainted: P            2.6.38.4+ #6
Call Trace:
  [<c043094b>] ? warn_slowpath_common+0x65/0x7a
  [<c04398a0>] ? del_timer_sync+0x25/0x37
  [<c043096f>] ? warn_slowpath_null+0xf/0x13
  [<c04398a0>] ? del_timer_sync+0x25/0x37
  [<c06e94f3>] ? linkwatch_schedule_work+0x68/0x83
  [<c06e95b6>] ? linkwatch_fire_event+0xa8/0xad
  [<c06ef193>] ? netif_carrier_on+0x23/0x34
  [<f8af1b32>] ? __rtl8169_check_link_status+0x4f/0xaf [r8169]
  [<f8af2376>] ? rtl8169_interrupt+0x1e3/0x287 [r8169]
  [<c046c87c>] ? handle_IRQ_event+0x1d/0x9c
  [<c046e130>] ? handle_edge_irq+0xba/0x100
  [<c046e076>] ? handle_edge_irq+0x0/0x100
  <IRQ>  [<c0404026>] ? do_IRQ+0x37/0x90
  [<c04035e9>] ? common_interrupt+0x29/0x30
  [<c0491283>] ? unmap_vmas+0x477/0x5b7
  [<c049298b>] ? __do_fault+0x2f3/0x323
  [<c0495676>] ? exit_mmap+0x7a/0xc3
  [<c042ecd1>] ? mmput+0x61/0xd5
  [<c0432430>] ? exit_mm+0x103/0x10b
  [<c0433cd9>] ? do_exit+0x1b4/0x5cc
  [<c0580fed>] ? copy_to_user+0x2f/0x106
  [<c043a88c>] ? sys_rt_sigaction+0x64/0x77
  [<c0434153>] ? do_group_exit+0x62/0x85
  [<c0772373>] ? do_device_not_available+0x0/0x1a
  [<c0434189>] ? sys_exit_group+0x13/0x17
  [<c040301c>] ? sysenter_do_call+0x12/0x28
---[ end trace da518b58347d6cca ]---

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply

* [PATCH 0/4] Minor ipv4 routing cleanups.
From: David Miller @ 2011-04-26 22:11 UTC (permalink / raw)
  To: netdev

Just a few cleanups and simplifications I came across while
auditing some aspects of the ipv4 routing code.

I'm going to do some more testing and validation on these changes
before I actually push them out to net-next-2.6

Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply

* [PATCH 1/4] ipv4: Respect 'saddr' and 'daddr' args to ip_build_and_send_pkt().
From: David Miller @ 2011-04-26 22:12 UTC (permalink / raw)
  To: netdev

This function ignores the passed in addresses and forces their
settings using rt->rt_dst and rt->rt_src.

There is never a reason to do this, because the socket of the
callers of this function must know what addresses it is using.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/ip_output.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index bdad3d6..e0d0d5d 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -158,8 +158,8 @@ int ip_build_and_send_pkt(struct sk_buff *skb, struct sock *sk,
 	else
 		iph->frag_off = 0;
 	iph->ttl      = ip_select_ttl(inet, &rt->dst);
-	iph->daddr    = rt->rt_dst;
-	iph->saddr    = rt->rt_src;
+	iph->daddr    = daddr;
+	iph->saddr    = saddr;
 	iph->protocol = sk->sk_protocol;
 	ip_select_ident(iph, &rt->dst, sk);

-- 
1.7.4.5

^ permalink raw reply related

* [PATCH 2/4] ipv4: Sanitize and simplify ip_route_{connect,newports}()
From: David Miller @ 2011-04-26 22:12 UTC (permalink / raw)
  To: netdev


These functions are used together as a unit for route resolution
during connect().  They address the chicken-and-egg problem that
exists when ports need to be allocated during connect() processing,
yet such port allocations require addressing information from the
routing code.

It's currently more heavy handed than it needs to be, and in
particular we allocate and initialize a flow object twice.

Let the callers provide the on-stack flow object.  That way we only
need to initialize it once in the ip_route_connect() call.

Later, if ip_route_newports() needs to do anything, it re-uses that
flow object as-is except for the ports which it updates before the
route re-lookup.

Also, describe why this set of facilities are needed and how it works
in a big comment.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/route.h |   88 ++++++++++++++++++++++++++++++++------------------
 net/dccp/ipv4.c     |   10 +++---
 net/ipv4/af_inet.c  |    3 +-
 net/ipv4/datagram.c |    3 +-
 net/ipv4/tcp_ipv4.c |   10 +++---
 net/l2tp/l2tp_ip.c  |    8 ++--
 6 files changed, 74 insertions(+), 48 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index 3684c3e..79530da 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -217,17 +217,37 @@ static inline char rt_tos2priority(u8 tos)
 	return ip_tos2prio[IPTOS_TOS(tos)>>1];
 }
 
-static inline struct rtable *ip_route_connect(__be32 dst, __be32 src, u32 tos,
-					      int oif, u8 protocol,
-					      __be16 sport, __be16 dport,
-					      struct sock *sk, bool can_sleep)
+/* ip_route_connect() and ip_route_newports() work in tandem whilst
+ * binding a socket for a new outgoing connection.
+ *
+ * In order to use IPSEC properly, we must, in the end, have a
+ * route that was looked up using all available keys including source
+ * and destination ports.
+ *
+ * However, if a source port needs to be allocated (the user specified
+ * a wildcard source port) we need to obtain addressing information
+ * in order to perform that allocation.
+ *
+ * So ip_route_connect() looks up a route using wildcarded source and
+ * destination ports in the key, simply so that we can get a pair of
+ * addresses to use for port allocation.
+ *
+ * Later, once the ports are allocated, ip_route_newports() will make
+ * another route lookup if needed to make sure we catch any IPSEC
+ * rules keyed on the port information.
+ *
+ * The callers allocate the flow key on their stack, and must pass in
+ * the same flowi4 object to both the ip_route_connect() and the
+ * ip_route_newports() calls.
+ */
+
+static inline void ip_route_connect_init(struct flowi4 *fl4, __be32 dst, __be32 src,
+					 u32 tos, int oif, u8 protocol,
+					 __be16 sport, __be16 dport,
+					 struct sock *sk, bool can_sleep)
 {
-	struct net *net = sock_net(sk);
-	struct rtable *rt;
-	struct flowi4 fl4;
-	__u8 flow_flags;
+	__u8 flow_flags = 0;
 
-	flow_flags = 0;
 	if (inet_sk(sk)->transparent)
 		flow_flags |= FLOWI_FLAG_ANYSRC;
 	if (protocol == IPPROTO_TCP)
@@ -235,41 +255,45 @@ static inline struct rtable *ip_route_connect(__be32 dst, __be32 src, u32 tos,
 	if (can_sleep)
 		flow_flags |= FLOWI_FLAG_CAN_SLEEP;
 
-	flowi4_init_output(&fl4, oif, sk->sk_mark, tos, RT_SCOPE_UNIVERSE,
+	flowi4_init_output(fl4, oif, sk->sk_mark, tos, RT_SCOPE_UNIVERSE,
 			   protocol, flow_flags, dst, src, dport, sport);
+}
+
+static inline struct rtable *ip_route_connect(struct flowi4 *fl4,
+					      __be32 dst, __be32 src, u32 tos,
+					      int oif, u8 protocol,
+					      __be16 sport, __be16 dport,
+					      struct sock *sk, bool can_sleep)
+{
+	struct net *net = sock_net(sk);
+	struct rtable *rt;
+
+	ip_route_connect_init(fl4, dst, src, tos, oif, protocol,
+			      sport, dport, sk, can_sleep);
 
 	if (!dst || !src) {
-		rt = __ip_route_output_key(net, &fl4);
+		rt = __ip_route_output_key(net, fl4);
 		if (IS_ERR(rt))
 			return rt;
-		fl4.daddr = rt->rt_dst;
-		fl4.saddr = rt->rt_src;
+		fl4->daddr = rt->rt_dst;
+		fl4->saddr = rt->rt_src;
 		ip_rt_put(rt);
 	}
-	security_sk_classify_flow(sk, flowi4_to_flowi(&fl4));
-	return ip_route_output_flow(net, &fl4, sk);
+	security_sk_classify_flow(sk, flowi4_to_flowi(fl4));
+	return ip_route_output_flow(net, fl4, sk);
 }
 
-static inline struct rtable *ip_route_newports(struct rtable *rt,
-					       u8 protocol, __be16 orig_sport,
-					       __be16 orig_dport, __be16 sport,
-					       __be16 dport, struct sock *sk)
+static inline struct rtable *ip_route_newports(struct flowi4 *fl4, struct rtable *rt,
+					       __be16 orig_sport, __be16 orig_dport,
+					       __be16 sport, __be16 dport,
+					       struct sock *sk)
 {
 	if (sport != orig_sport || dport != orig_dport) {
-		struct flowi4 fl4;
-		__u8 flow_flags;
-
-		flow_flags = 0;
-		if (inet_sk(sk)->transparent)
-			flow_flags |= FLOWI_FLAG_ANYSRC;
-		if (protocol == IPPROTO_TCP)
-			flow_flags |= FLOWI_FLAG_PRECOW_METRICS;
-		flowi4_init_output(&fl4, rt->rt_oif, rt->rt_mark, rt->rt_tos,
-				   RT_SCOPE_UNIVERSE, protocol, flow_flags,
-				   rt->rt_dst, rt->rt_src, dport, sport);
+		fl4->fl4_dport = dport;
+		fl4->fl4_sport = sport;
 		ip_rt_put(rt);
-		security_sk_classify_flow(sk, flowi4_to_flowi(&fl4));
-		return ip_route_output_flow(sock_net(sk), &fl4, sk);
+		security_sk_classify_flow(sk, flowi4_to_flowi(fl4));
+		return ip_route_output_flow(sock_net(sk), fl4, sk);
 	}
 	return rt;
 }
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index ae451c6..b92ab65 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -40,12 +40,13 @@
 
 int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 {
+	const struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
 	struct inet_sock *inet = inet_sk(sk);
 	struct dccp_sock *dp = dccp_sk(sk);
-	const struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
 	__be16 orig_sport, orig_dport;
-	struct rtable *rt;
 	__be32 daddr, nexthop;
+	struct flowi4 fl4;
+	struct rtable *rt;
 	int err;
 
 	dp->dccps_role = DCCP_ROLE_CLIENT;
@@ -65,7 +66,7 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 
 	orig_sport = inet->inet_sport;
 	orig_dport = usin->sin_port;
-	rt = ip_route_connect(nexthop, inet->inet_saddr,
+	rt = ip_route_connect(&fl4, nexthop, inet->inet_saddr,
 			      RT_CONN_FLAGS(sk), sk->sk_bound_dev_if,
 			      IPPROTO_DCCP,
 			      orig_sport, orig_dport, sk, true);
@@ -101,8 +102,7 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	if (err != 0)
 		goto failure;
 
-	rt = ip_route_newports(rt, IPPROTO_DCCP,
-			       orig_sport, orig_dport,
+	rt = ip_route_newports(&fl4, rt, orig_sport, orig_dport,
 			       inet->inet_sport, inet->inet_dport, sk);
 	if (IS_ERR(rt)) {
 		rt = NULL;
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index cae75ef..0413af3 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1103,6 +1103,7 @@ static int inet_sk_reselect_saddr(struct sock *sk)
 	struct inet_sock *inet = inet_sk(sk);
 	__be32 old_saddr = inet->inet_saddr;
 	__be32 daddr = inet->inet_daddr;
+	struct flowi4 fl4;
 	struct rtable *rt;
 	__be32 new_saddr;
 
@@ -1110,7 +1111,7 @@ static int inet_sk_reselect_saddr(struct sock *sk)
 		daddr = inet->opt->faddr;
 
 	/* Query new route. */
-	rt = ip_route_connect(daddr, 0, RT_CONN_FLAGS(sk),
+	rt = ip_route_connect(&fl4, daddr, 0, RT_CONN_FLAGS(sk),
 			      sk->sk_bound_dev_if, sk->sk_protocol,
 			      inet->inet_sport, inet->inet_dport, sk, false);
 	if (IS_ERR(rt))
diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
index 85bd24c..216ba23 100644
--- a/net/ipv4/datagram.c
+++ b/net/ipv4/datagram.c
@@ -24,6 +24,7 @@ int ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct sockaddr_in *usin = (struct sockaddr_in *) uaddr;
+	struct flowi4 fl4;
 	struct rtable *rt;
 	__be32 saddr;
 	int oif;
@@ -46,7 +47,7 @@ int ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 		if (!saddr)
 			saddr = inet->mc_addr;
 	}
-	rt = ip_route_connect(usin->sin_addr.s_addr, saddr,
+	rt = ip_route_connect(&fl4, usin->sin_addr.s_addr, saddr,
 			      RT_CONN_FLAGS(sk), oif,
 			      sk->sk_protocol,
 			      inet->inet_sport, usin->sin_port, sk, true);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index edf18bd..310454c 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -146,12 +146,13 @@ EXPORT_SYMBOL_GPL(tcp_twsk_unique);
 /* This will initiate an outgoing connection. */
 int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 {
+	struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
 	struct inet_sock *inet = inet_sk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
-	struct sockaddr_in *usin = (struct sockaddr_in *)uaddr;
 	__be16 orig_sport, orig_dport;
-	struct rtable *rt;
 	__be32 daddr, nexthop;
+	struct flowi4 fl4;
+	struct rtable *rt;
 	int err;
 
 	if (addr_len < sizeof(struct sockaddr_in))
@@ -169,7 +170,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 
 	orig_sport = inet->inet_sport;
 	orig_dport = usin->sin_port;
-	rt = ip_route_connect(nexthop, inet->inet_saddr,
+	rt = ip_route_connect(&fl4, nexthop, inet->inet_saddr,
 			      RT_CONN_FLAGS(sk), sk->sk_bound_dev_if,
 			      IPPROTO_TCP,
 			      orig_sport, orig_dport, sk, true);
@@ -236,8 +237,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	if (err)
 		goto failure;
 
-	rt = ip_route_newports(rt, IPPROTO_TCP,
-			       orig_sport, orig_dport,
+	rt = ip_route_newports(&fl4, rt, orig_sport, orig_dport,
 			       inet->inet_sport, inet->inet_dport, sk);
 	if (IS_ERR(rt)) {
 		err = PTR_ERR(rt);
diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c
index fce9bd3..cc67367 100644
--- a/net/l2tp/l2tp_ip.c
+++ b/net/l2tp/l2tp_ip.c
@@ -296,12 +296,12 @@ out_in_use:
 
 static int l2tp_ip_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 {
-	int rc;
-	struct inet_sock *inet = inet_sk(sk);
 	struct sockaddr_l2tpip *lsa = (struct sockaddr_l2tpip *) uaddr;
+	struct inet_sock *inet = inet_sk(sk);
+	struct flowi4 fl4;
 	struct rtable *rt;
 	__be32 saddr;
-	int oif;
+	int oif, rc;
 
 	rc = -EINVAL;
 	if (addr_len < sizeof(*lsa))
@@ -320,7 +320,7 @@ static int l2tp_ip_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len
 	if (ipv4_is_multicast(lsa->l2tp_addr.s_addr))
 		goto out;
 
-	rt = ip_route_connect(lsa->l2tp_addr.s_addr, saddr,
+	rt = ip_route_connect(&fl4, lsa->l2tp_addr.s_addr, saddr,
 			      RT_CONN_FLAGS(sk), oif,
 			      IPPROTO_L2TP,
 			      0, 0, sk, true);
-- 
1.7.4.5


^ permalink raw reply related

* [PATCH 3/4] ipv4: Remove erroneous check in igmpv3_newpack() and igmp_send_report().
From: David Miller @ 2011-04-26 22:12 UTC (permalink / raw)
  To: netdev

Output route resolution never returns a route with rt_src set to zero
(which is INADDR_ANY).

Even if the flow key for the output route lookup specifies INADDR_ANY
for the source address, the output route resolution chooses a real
source address to use in the final route.

This test has existed forever in igmp_send_report() and David Stevens
simply copied over the erroneous test when implementing support for
IGMPv3.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/igmp.c |   10 ----------
 1 files changed, 0 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index 1fd3d9c..8ae0a57 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -328,11 +328,6 @@ static struct sk_buff *igmpv3_newpack(struct net_device *dev, int size)
 		kfree_skb(skb);
 		return NULL;
 	}
-	if (rt->rt_src == 0) {
-		kfree_skb(skb);
-		ip_rt_put(rt);
-		return NULL;
-	}

 	skb_dst_set(skb, &rt->dst);
 	skb->dev = dev;
@@ -670,11 +665,6 @@ static int igmp_send_report(struct in_device *in_dev, struct ip_mc_list *pmc,
 	if (IS_ERR(rt))
 		return -1;

-	if (rt->rt_src == 0) {
-		ip_rt_put(rt);
-		return -1;
-	}
-
 	skb = alloc_skb(IGMP_SIZE+LL_ALLOCATED_SPACE(dev), GFP_ATOMIC);
 	if (skb == NULL) {
 		ip_rt_put(rt);
-- 
1.7.4.5

^ permalink raw reply related

* [PATCH 4/4] ipv4: Kill RTO_CONN.
From: David Miller @ 2011-04-26 22:12 UTC (permalink / raw)
  To: netdev


It's not used by anything in the kernel, and defined in net/route.h so
never exported to userspace.

Therefore we can safely remove it.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/route.h |    4 ----
 1 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index 79530da..fdbdb92 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -37,10 +37,6 @@
 
 #define RTO_ONLINK	0x01
 
-#define RTO_CONN	0
-/* RTO_CONN is not used (being alias for 0), but preserved not to break
- * some modules referring to it. */
-
 #define RT_CONN_FLAGS(sk)   (RT_TOS(inet_sk(sk)->tos) | sock_flag(sk, SOCK_LOCALROUTE))
 
 struct fib_nh;
-- 
1.7.4.5


^ permalink raw reply related

* Re: [PATCH 12/13] mm: Throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage
From: NeilBrown @ 2011-04-26 23:18 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Peter Zijlstra
In-Reply-To: <20110426142624.GH4658@suse.de>

On Tue, 26 Apr 2011 15:26:24 +0100 Mel Gorman <mgorman@suse.de> wrote:

> On Tue, Apr 26, 2011 at 10:30:59PM +1000, NeilBrown wrote:
> > On Tue, 26 Apr 2011 08:36:53 +0100 Mel Gorman <mgorman@suse.de> wrote:
> > 
> > 
> > > +/*
> > > + * Throttle direct reclaimers if backing storage is backed by the network
> > > + * and the PFMEMALLOC reserve for the preferred node is getting dangerously
> > > + * depleted. kswapd will continue to make progress and wake the processes
> > > + * when the low watermark is reached
> > > + */
> > > +static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> > > +					nodemask_t *nodemask)
> > > +{
> > > +	struct zone *zone;
> > > +	int high_zoneidx = gfp_zone(gfp_mask);
> > > +	DEFINE_WAIT(wait);
> > > +
> > > +	/* Check if the pfmemalloc reserves are ok */
> > > +	first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
> > > +	prepare_to_wait(&zone->zone_pgdat->pfmemalloc_wait, &wait,
> > > +							TASK_INTERRUPTIBLE);
> > > +	if (pfmemalloc_watermark_ok(zone->zone_pgdat, high_zoneidx))
> > > +		goto out;
> > > +
> > > +	/* Throttle */
> > > +	do {
> > > +		schedule();
> > > +		finish_wait(&zone->zone_pgdat->pfmemalloc_wait, &wait);
> > > +		prepare_to_wait(&zone->zone_pgdat->pfmemalloc_wait, &wait,
> > > +							TASK_INTERRUPTIBLE);
> > > +	} while (!pfmemalloc_watermark_ok(zone->zone_pgdat, high_zoneidx) &&
> > > +			!fatal_signal_pending(current));
> > > +
> > > +out:
> > > +	finish_wait(&zone->zone_pgdat->pfmemalloc_wait, &wait);
> > > +}
> > 
> > You are doing an interruptible wait, but only checking for fatal signals.
> > So if a non-fatal signal arrives, you will busy-wait.
> > 
> > So I suspect you want TASK_KILLABLE, so just use:
> > 
> >     wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
> >                         pgmemalloc_watermark_ok(zone->zone_pgdata,
> >                                                 high_zoneidx));
> > 
> 
> Well, if a normal signal arrives, we do not necessarily want the
> process to enter reclaim. For fatal signals, I allow it to continue
> because it's not likely to be putting the system under more pressure
> if it's exiting.

Yep, I understand that and it doesn't seem unreasonable.

However I don't think the code implements that correctly.

If you get a non-fatal signal, schedule will exit immediately (because of the
TASK_INTERRUPTIBLE setting) and the 'while' clause will succeed because the
signal is not fatal, so it will loop around and try to schedule again, which
will again exit immediately - busy loop.

> 
> > (You also have an extraneous call to finish_wait)
> > 
> 
> Which one? I'm not seeing a flow where finish_wait gets called twice
> without a prepare_to_wait in between. 
> 

You don't need to call finish_wait immediately before prepare_to_wait.

It really is best to just use the appropriate 'wait_event*' macro....

NeilBrown

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 09/13] netvm: Set PF_MEMALLOC as appropriate during SKB processing
From: NeilBrown @ 2011-04-26 23:22 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Linux-Netdev, LKML, David Miller, Peter Zijlstra
In-Reply-To: <20110426141048.GG4658@suse.de>

On Tue, 26 Apr 2011 15:10:48 +0100 Mel Gorman <mgorman@suse.de> wrote:

> On Tue, Apr 26, 2011 at 10:21:57PM +1000, NeilBrown wrote:
> > On Tue, 26 Apr 2011 08:36:50 +0100 Mel Gorman <mgorman@suse.de> wrote:
> > 
> > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > index 3871bf6..2d79a20 100644
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -3095,6 +3095,27 @@ static void vlan_on_bond_hook(struct sk_buff *skb)
> > >  	}
> > >  }
> > >  
> > > +/*
> > > + * Limit which protocols can use the PFMEMALLOC reserves to those that are
> > > + * expected to be used for communication with swap.
> > > + */
> > > +static bool skb_pfmemalloc_protocol(struct sk_buff *skb)
> > > +{
> > > +	if (skb_pfmemalloc(skb))
> > > +		switch (skb->protocol) {
> > > +		case __constant_htons(ETH_P_ARP):
> > > +		case __constant_htons(ETH_P_IP):
> > > +		case __constant_htons(ETH_P_IPV6):
> > > +		case __constant_htons(ETH_P_8021Q):
> > > +			break;
> > > +
> > > +		default:
> > > +			return false;
> > > +		}
> > > +
> > > +	return true;
> > > +}
> > 
> > This sort of thing really bugs me :-)
> > Neither the comment nor the function name actually describe what the function
> > is doing.  The function is checking *2* things.
> >    is_pfmemalloc_skb_or_pfmemalloc_protocol()
> > might be a more correct name, but is too verbose.
> > 
> > I would prefer the skb_pfmemalloc test were removed from here and ....
> > 
> > > +	if (!skb_pfmemalloc_protocol(skb))
> > > +		goto drop;
> > > +
> > 
> > ...added here so this becomes:
> > 
> >       if (!skb_pfmemalloc(skb) && !skb_pfmemalloc_protocol(skb))
> >                 goto drop;
> > 
> > which actually makes sense.
> > 
> 
> Moving the check is neater but that check should be
> 
> 	if (skb_pfmemalloc(skb) && !skb_pfmemalloc_protocol(skb))
> 
> ? It's only if the skb was allocated from emergency reserves that we
> need to consider dropping it to make way for other packets to be
> received.
> 

Correct.  I got my Boolean algebra all confused. Sorry 'bout that.

NeilBrown

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* RE: Strange igb bug, out-of-tree driver seems to work fine.
From: Wyborny, Carolyn @ 2011-04-26 23:23 UTC (permalink / raw)
  To: Ben Greear; +Cc: netdev
In-Reply-To: <4DB0F965.4080605@candelatech.com>

>-----Original Message-----
>From: Ben Greear [mailto:greearb@candelatech.com]
>Sent: Thursday, April 21, 2011 8:44 PM
>To: Wyborny, Carolyn
>Cc: netdev
>Subject: Re: Strange igb bug, out-of-tree driver seems to work fine.
>
>On 04/21/2011 05:16 PM, Wyborny, Carolyn wrote:
>>
>>
>>> -----Original Message-----
>>> From: netdev-owner@vger.kernel.org [mailto:netdev-
>owner@vger.kernel.org]
>>> On Behalf Of Ben Greear
>>> Sent: Thursday, April 21, 2011 2:39 PM
>>> To: netdev
>>> Subject: Strange igb bug, out-of-tree driver seems to work fine.
>>>
>>> We have a 4-port NIC using the igb driver in a couple of systems,
>>> and saw some strange issues (mostly rx CRC errors).  We tried 2.6.34,
>>> 2.6.36, and 2.6.38 based kernels and all showed issues.  The NICs
>>> are from two different vendors, though both use the igb driver.
>>>
>>> We then tried 2.6.36.3 with the out-of-tree igb-2.4.13.tar.gz
>>> driver.  And everything seems to run clean.
>>>
>>> My kernels are somewhat hacked and tainted, and this testing
>>> is using my tainting module, but since changing only the driver
>>> seems to fix things, it's _probably_ not my fault this time!
>>>
>>> I am running a small hack to make igb work better with VLANs
>>> in my 2.6.36 and .38 kernel, though not in the .34.  Here is the .36
>>> diff:
>
>>
>> Hello,
>>
>> There have been recent patches for igb that I submitted that might
>contain the fix you found in the out-of-tree driver, although I'm not
>sure which one would affect rx CRC errors.  Were there other symptoms?
>Any messages in the log?  I'll look through the patches and see if there
>is anything obvious and get you some commit numbers to try.
>
>There was nothing else that I noticed.  If your patches are in 2.6.39-
>rc4+, I can try
>them there perhaps?
>
>Thanks,
>Ben
>
>
>--
>Ben Greear <greearb@candelatech.com>
>Candela Technologies Inc  http://www.candelatech.com

Hello,

I'm sorry for the delay in responding.  I'm really scratching my head on this one as we don't do much in the driver that affects what we get on receive.  I've seen situations where some switches end up transmitting more of these and then we record more of them, but I'm guessing you're testing with the same equipment, just a different driver version.  Let me know if I'm mistaken there.  

So, to answer your question, I believe my patches are there, but I did review them again and I'm not sure they will make any difference.  My latest batch of patches was to add features to the i350 device specifically.

Give it try though and let me know if you see any difference with 2.6.39-rc4+.

Thanks,

Carolyn

Carolyn Wyborny
Linux Development
LAN Access Division
Intel Corporation

^ permalink raw reply

* RE: Kernel crash after using new Intel NIC (igb)
From: Wyborny, Carolyn @ 2011-04-26 23:34 UTC (permalink / raw)
  To: Maximilian Engelhardt, linux-kernel@vger.kernel.org
  Cc: netdev@vger.kernel.org, StuStaNet Vorstand,
	e1000-devel@lists.sourceforge.net
In-Reply-To: <201104250033.03401.maxi@daemonizer.de>



>-----Original Message-----
>From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
>On Behalf Of Maximilian Engelhardt
>Sent: Sunday, April 24, 2011 3:33 PM
>To: linux-kernel@vger.kernel.org
>Cc: netdev@vger.kernel.org; StuStaNet Vorstand
>Subject: Kernel crash after using new Intel NIC (igb)
>
>Hello,
>
>some time ago we switched some of our servers to a new networking card
>that
>uses the Intel igb driver. Since that time we see regular kernel
>crashes.
>The crashes happen at very irregular intervals, sometimes after a week
>uptime,
>sometimes after a month or even more. They seem to be independent of the
>server load as they also happen in the night when there is low traffic.
>
>The affected server is used as a NAT device with some iptables rules and
>serves
>about 2000 people.
>
>Attached are two logs of the crashes as well as the output of dmesg,
>lspci,
>and /proc/interrupts as well as the used kernel config.
>
>I have no idea what might be wrong but I think it is a kernel bug.
>Perhaps
>someone with more knowledge has a clue.
>
>If needed I can provide additional information or build different
>kernels.
>
>Greetings,
>Maxi

Hello,

I'm sorry you're having crashes since installing our NIC.  Thank you for the data.  I haven't had a chance to review it carefully yet, but it looks to me like the crashes have us in the stack sometimes and sometimes not.  I need to do a bit more research and will need some more information.  Can I get an ethtool -i eth# for the device and also lspci -vvv for the platform its installed on.

If you open an issue at SourceForge we will have a place to keep the logs.

I will research this a bit more and get back to you tomorrow my time.

Thanks,

Carolyn
Carolyn Wyborny
Linux Development
LAN Access Division
Intel Corporation





^ permalink raw reply

* Re: [PATCH] net: fix netdev_increment_features()
From: Ben Hutchings @ 2011-04-27  0:26 UTC (permalink / raw)
  To: Michał Mirosław
  Cc: netdev, Jay Vosburgh, Andy Gospodarek, Stephen Hemminger, bridge
In-Reply-To: <20110422163116.3C5C3138DD@rere.qmqm.pl>

On Fri, 2011-04-22 at 18:31 +0200, Michał Mirosław wrote:
> Simplify and fix netdev_increment_features() to conform to what is
> stated in netdevice.h comments about NETIF_F_ONE_FOR_ALL.
> Include FCoE segmentation and VLAN-challedged flags in computation.
> 
> Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
> ---
> 
> I noticed this while converting bridge code to hw_features. After
> adding devices to the bridge all features were always cleared regardless
> of features active in those devices.
> 
>  include/linux/netdevice.h |    7 ++++++-
>  net/core/dev.c            |   37 ++++++++++++-------------------------
>  2 files changed, 18 insertions(+), 26 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 405ce21..86140e4 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1107,7 +1107,12 @@ struct net_device {
>  	 */
>  #define NETIF_F_ONE_FOR_ALL	(NETIF_F_GSO_SOFTWARE | NETIF_F_GSO_ROBUST | \
>  				 NETIF_F_SG | NETIF_F_HIGHDMA |		\
> -				 NETIF_F_FRAGLIST)
> +				 NETIF_F_FRAGLIST | NETIF_F_VLAN_CHALLENGED)

I think we can also add NETIF_F_HW_VLAN_TX, now that
dev_hard_start_xmit() handles VLAN tag insertion.

> +	/*
> +	 * If one device doesn't support one of these features, then disable it
> +	 * for all in netdev_increment_features.
> +	 */
> +#define NETIF_F_ALL_FOR_ALL	(NETIF_F_NOCACHE_COPY | NETIF_F_FSO)

Shouldn't most RX offloads be included in this?  Though it doesn't
really matter that much.

>  	/* changeable features with no special hardware requirements */
>  #define NETIF_F_SOFT_FEATURES	(NETIF_F_GSO | NETIF_F_GRO)
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 3871bf6..1a2c91c 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6184,33 +6184,20 @@ static int dev_cpu_callback(struct notifier_block *nfb,
>   */
>  u32 netdev_increment_features(u32 all, u32 one, u32 mask)
>  {
> +	if (mask & NETIF_F_GEN_CSUM)
> +		mask |= NETIF_F_ALL_CSUM;
> +	mask |= NETIF_F_VLAN_CHALLENGED;
> +
> +	all |= one & (NETIF_F_ONE_FOR_ALL|NETIF_F_ALL_CSUM) & mask;
> +	all &= one | ~NETIF_F_ALL_FOR_ALL;
> +
>  	/* If device needs checksumming, downgrade to it. */
> -	if (all & NETIF_F_NO_CSUM && !(one & NETIF_F_NO_CSUM))
> -		all ^= NETIF_F_NO_CSUM | (one & NETIF_F_ALL_CSUM);
> -	else if (mask & NETIF_F_ALL_CSUM) {
> -		/* If one device supports v4/v6 checksumming, set for all. */
> -		if (one & (NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM) &&
> -		    !(all & NETIF_F_GEN_CSUM)) {
> -			all &= ~NETIF_F_ALL_CSUM;
> -			all |= one & (NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM);
> -		}
> +	if (all & (NETIF_F_ALL_CSUM & ~NETIF_F_NO_CSUM))
> +		all &= ~NETIF_F_NO_CSUM;
>  
> -		/* If one device supports hw checksumming, set for all. */
> -		if (one & NETIF_F_GEN_CSUM && !(all & NETIF_F_GEN_CSUM)) {
> -			all &= ~NETIF_F_ALL_CSUM;
> -			all |= NETIF_F_HW_CSUM;
> -		}
> -	}
> -
> -	/* If device can't no cache copy, don't do for all */
> -	if (!(one & NETIF_F_NOCACHE_COPY))
> -		all &= ~NETIF_F_NOCACHE_COPY;
> -
> -	one |= NETIF_F_ALL_CSUM;
> -
> -	one |= all & NETIF_F_ONE_FOR_ALL;
> -	all &= one | NETIF_F_LLTX | NETIF_F_GSO | NETIF_F_UFO;
> -	all |= one & mask & NETIF_F_ONE_FOR_ALL;
> +	/* If one device supports hw checksumming, set for all. */
> +	if (all & NETIF_F_GEN_CSUM)
> +		all &= ~(NETIF_F_ALL_CSUM & ~NETIF_F_GEN_CSUM);
>  
>  	return all;
>  }

Well it might be correct, but it's hard to tell.  Can we see your unit
tests, please?

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH net-next-2.6 0/7] SCTP updates for net-next-2.6
From: Wei Yongjun @ 2011-04-27  0:59 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-sctp
In-Reply-To: <20110426.145120.28826019.davem@davemloft.net>

> Wei, while you are re-spinning this patch set I want to bring up
> something I just noticed in the SCTP code.
>
> The ->dst_saddr() method is not used by anything, it appears.
>
> The ipv4 variant, sctp_v4_dst_saddr() is called internally by the
> ipv4 specific code, but that's it.
>
> So I think the ->dst_saddr member of sctp_pf can be completely
> removed, as can sctp_v6_dst_saddr().
>
> The sctp_v4_dst_saddr() function, of course, will need to be retained.

David, thanks to noticed this. I will cleanup it.

And I have a stupid question about the rule of backport.  Since those
patchs have existed so long time, when I backport those patchs, I'd better
fix the bug in the original patch, or create new patch to fix it? Also how
about some thing need to improvement like the ->dst_saddr() method?

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox