Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v2] netdev:bfin_mac: reclaim and free tx skb as soon as possible after transfer
From: Sonic Zhang @ 2010-06-04  4:44 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, uclinux-dist-devel
In-Reply-To: <1275624354.2533.121.camel@edumazet-laptop>

On Fri, Jun 4, 2010 at 12:05 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le vendredi 04 juin 2010 à 11:29 +0800, Sonic Zhang a écrit :
>> David,
>>
>> Any comments?
>>
>> Thanks
>>
>> Sonic
>>
>> On Thu, Jun 3, 2010 at 11:48 AM, sonic zhang <sonic.adi@gmail.com> wrote:
>> > >From 40560ae9e8db42e2d2259b791ace160534c9a0f2 Mon Sep 17 00:00:00 2001
>> > From: Sonic Zhang <sonic.zhang@analog.com>
>> > Date: Thu, 3 Jun 2010 11:44:33 +0800
>> > Subject: [PATCH v2] netdev:bfin_mac: reclaim and free tx skb as soon as possible after transfer
>> >
>> > SKBs hold onto resources that can't be held indefinitely, such as TCP
>> > socket references and netfilter conntrack state.  So if a packet is left
>> > in TX ring for a long time, there might be a TCP socket that cannot be
>> > closed and freed up.
>> >
>> > Current blackfin EMAC driver always reclaim and free used tx skbs in future
>> > transfers. The problem is that future transfer may not come as soon as
>> > possible. This patch start a timer after transfer to reclaim and free skb.
>> > There is nearly no performance drop with this patch.
>> >
>> > TX interrupt is not enabled for 2 reasons:
>> >
>> > 1) If Blackfin EMAC TX transfer control is turned on, endless TX
>> > interrupts are triggered no matter if TX DMA is enabled. Since DMA walks
>> > down the ring automatically, TX transfer control can't be turned off in the
>> > middle. The only way is to disable TX interrupt completely.
>> >
>> > 2) skb can not be freed from interrupt context. A work queue or tasklet
>> > has to be created, which introduce more overhead than timer only solution.
>> >
>
> Could you elaborate on this second point ?
>
> skb can be freed from interrupt context using appropriate API :
>
> 1) If from NAPI context, no special care is needed and use
> dev_kfree_skb().
>
> 2) If from hard irq context, use dev_kfree_skb_irq() :
>

Yes, you are right. dev_kfree_skb_irq() queues used skb to the
complete queue. But, it is actually freed in the other soft irq
NET_TX_SOFTIRQ.


> With recent changes, skb is probably already orphaned and can be freed
> immediately.
>
> In the unlikely case it is not yet orphaned, skb is queued in
> softnet_data.completion_queue.
>
>
>
>

^ permalink raw reply

* Re: linux-next: manual merge of the wireless tree with the net tree
From: Luciano Coelho @ 2010-06-04  6:14 UTC (permalink / raw)
  To: ext Stephen Rothwell
  Cc: John W. Linville, linux-next@vger.kernel.org,
	linux-kernel@vger.kernel.org, Oikarinen Juuso (Nokia-D/Tampere),
	Eric Dumazet, David Miller, netdev@vger.kernel.org
In-Reply-To: <20100604114801.df31fb50.sfr@canb.auug.org.au>

On Fri, 2010-06-04 at 03:48 +0200, ext Stephen Rothwell wrote:
> Hi John,
> 
> Today's linux-next merge of the wireless tree got a conflict in
> drivers/net/wireless/wl12xx/wl1271.h
> drivers/net/wireless/wl12xx/wl1271_cmd.h between commit
> ba2d3587912f82d1ab4367975b1df460db60fb1e ("drivers/net: use __packed
> annotation") from the net tree and commit
> eb70eb723b489dd4e233e22e47d993f59858cdd8 ("wl1271: Update handling of the
> NVS file / INI parameters") from the wireless tree.
> 
> I fixed up some if it (see below) and can carry the fixes as necessary.
> The latter patch also moved some of the structures to another file
> (drivers/net/wireless/wl12xx/wl1271_ini.h), so they will need fixing up
> there.

Acked-by: Luciano Coelho <luciano.coelho@nokia.com>

Thanks for the merge.  I'll submit a new patch fixing the annotation
from the structs that were moved.

-- 
Cheers,
Luca.

^ permalink raw reply

* [PATCH] greth: Remove unnecessary memset of napi member in netdev private data
From: Tobias Klauser @ 2010-06-04  6:24 UTC (permalink / raw)
  To: kristoffer, davem, netdev; +Cc: Tobias Klauser

The memory for the private data is allocated using kzalloc in
alloc_etherdev (or alloc_netdev_mq respectively) so there is no need to
set the napi member to 0 explicitely.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
---
 drivers/net/greth.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/drivers/net/greth.c b/drivers/net/greth.c
index 3a029d0..4d09eab 100644
--- a/drivers/net/greth.c
+++ b/drivers/net/greth.c
@@ -1555,7 +1555,6 @@ static int __devinit greth_of_probe(struct of_device *ofdev, const struct of_dev
 	}
 
 	/* setup NAPI */
-	memset(&greth->napi, 0, sizeof(greth->napi));
 	netif_napi_add(dev, &greth->napi, greth_poll, 64);
 
 	return 0;
-- 
1.6.3.3


^ permalink raw reply related

* Re: [RFC][PATCH] Fix another namespace issue with devices assigned to classes
From: Johannes Berg @ 2010-06-04  6:54 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Greg KH, netdev, Kay Sievers
In-Reply-To: <m17hmhrl6v.fsf_-_@fess.ebiederm.org>

On Wed, 2010-06-02 at 17:53 -0700, Eric W. Biederman wrote:

> Johannes this should fix your issue with mac80211_hwsim, where
> the device symlink were not destroyed when the driver was removed.

It does, thank you.

FWIW, I'm happy changing hwsim too, but I don't think I quite understand
what you're proposing in your other email so I'll leave it up to you
since you now know what is causing the problem :)

johannes

^ permalink raw reply

* Re: 200 millisecond timeouts in TCP
From: Ryousei Takano @ 2010-06-04  6:58 UTC (permalink / raw)
  To: Ivan Novick; +Cc: netdev
In-Reply-To: <AANLkTils4KHXwX79cLG6UvVOaDdddMqx-1tyUL3VtLQC@mail.gmail.com>

On Fri, Jun 4, 2010 at 7:37 AM, Ivan Novick <novickivan@gmail.com> wrote:
> Hello,
>
> Using tcpdump and systemtap I am seeing that sometimes retransmission
> of data is sent after waiting 200 milliseconds.  However sometimes
> retransmissions happen quicker.
>
> Is there a specifc event that causes these 200 milisec delays to kick
> in?  Are those events identifiable in netstat -s output?
>
> Also do you know if the timeout numbers for TCP are configurable parameters?
>
The minimum RTO value is fixed to 200 ms.  It is useful to make the min/max
RTO values tunable.  For example, reducing the minimum RTO value is effective
for TCP incast problem [1].  Of course, it may occur spurious retransmissions.

[1] Vijay Vasudevan, et al, Safe and Effective Fine-grained TCP Retransmissions
for Datacenter Communication, SIGCOMM2009

In Solaris, there are two tunable parameters: tcp_rexmit_interval_min/max.

Do you have plan to introduce sysctl parameters like these to the Linux.

Thanks,
Ryousei

> I am testing on RHEL5 with this kernel: 2.6.18-164.15.1.el5.
>
> Cheers,
> Ivan Novick
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: [RFC][PATCH] Fix another namespace issue with devices assigned to classes
From: Kay Sievers @ 2010-06-04  8:15 UTC (permalink / raw)
  To: Johannes Berg; +Cc: Eric W. Biederman, Greg KH, netdev
In-Reply-To: <1275634452.5189.1.camel@jlt3.sipsolutions.net>

On Fri, Jun 4, 2010 at 08:54, Johannes Berg <johannes@sipsolutions.net> wrote:
> On Wed, 2010-06-02 at 17:53 -0700, Eric W. Biederman wrote:
>
>> Johannes this should fix your issue with mac80211_hwsim, where
>> the device symlink were not destroyed when the driver was removed.
>
> It does, thank you.
>
> FWIW, I'm happy changing hwsim too, but I don't think I quite understand
> what you're proposing in your other email so I'll leave it up to you
> since you now know what is causing the problem :)

Assuming that hwsim is th parent of the network interface, it should
us a "struct bus_type" not a "struct class" for the subsystem it
assigns the devices to.

Classes should not be used for anything completely simple, at best not
be used at all, they are just too simple. We never know about future
requirements, which usually all go wrong with the non-extendable class
logic.

The difference in the code to switch from class to bus should be minimal.

Cheers,
Kay

^ permalink raw reply

* [PATCH net-next-2.6] raw: avoid two atomics in xmit
From: Eric Dumazet @ 2010-06-04  8:23 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Avoid two atomic ops per raw_send_hdrinc() call

Avoid two atomic ops per raw6_send_hdrinc() call

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/raw.c |    8 +++++---
 net/ipv6/raw.c |   12 +++++++-----
 2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 2c7a163..66cc3be 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -314,7 +314,7 @@ int raw_rcv(struct sock *sk, struct sk_buff *skb)
 }
 
 static int raw_send_hdrinc(struct sock *sk, void *from, size_t length,
-			struct rtable *rt,
+			struct rtable **rtp,
 			unsigned int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
@@ -323,6 +323,7 @@ static int raw_send_hdrinc(struct sock *sk, void *from, size_t length,
 	struct sk_buff *skb;
 	unsigned int iphlen;
 	int err;
+	struct rtable *rt = *rtp;
 
 	if (length > rt->u.dst.dev->mtu) {
 		ip_local_error(sk, EMSGSIZE, rt->rt_dst, inet->inet_dport,
@@ -341,7 +342,8 @@ static int raw_send_hdrinc(struct sock *sk, void *from, size_t length,
 
 	skb->priority = sk->sk_priority;
 	skb->mark = sk->sk_mark;
-	skb_dst_set(skb, dst_clone(&rt->u.dst));
+	skb_dst_set(skb, &rt->u.dst);
+	*rtp = NULL;
 
 	skb_reset_network_header(skb);
 	iph = ip_hdr(skb);
@@ -576,7 +578,7 @@ back_from_confirm:
 
 	if (inet->hdrincl)
 		err = raw_send_hdrinc(sk, msg->msg_iov, len,
-					rt, msg->msg_flags);
+					&rt, msg->msg_flags);
 
 	 else {
 		if (!ipc.addr)
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 864eb8e..968b964 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -602,13 +602,14 @@ out:
 }
 
 static int rawv6_send_hdrinc(struct sock *sk, void *from, int length,
-			struct flowi *fl, struct rt6_info *rt,
+			struct flowi *fl, struct dst_entry **dstp,
 			unsigned int flags)
 {
 	struct ipv6_pinfo *np = inet6_sk(sk);
 	struct ipv6hdr *iph;
 	struct sk_buff *skb;
 	int err;
+	struct rt6_info *rt = (struct rt6_info *)*dstp;
 
 	if (length > rt->u.dst.dev->mtu) {
 		ipv6_local_error(sk, EMSGSIZE, fl, rt->u.dst.dev->mtu);
@@ -626,7 +627,8 @@ static int rawv6_send_hdrinc(struct sock *sk, void *from, int length,
 
 	skb->priority = sk->sk_priority;
 	skb->mark = sk->sk_mark;
-	skb_dst_set(skb, dst_clone(&rt->u.dst));
+	skb_dst_set(skb, &rt->u.dst);
+	*dstp = NULL;
 
 	skb_put(skb, length);
 	skb_reset_network_header(skb);
@@ -886,9 +888,9 @@ static int rawv6_sendmsg(struct kiocb *iocb, struct sock *sk,
 		goto do_confirm;
 
 back_from_confirm:
-	if (inet->hdrincl) {
-		err = rawv6_send_hdrinc(sk, msg->msg_iov, len, &fl, (struct rt6_info*)dst, msg->msg_flags);
-	} else {
+	if (inet->hdrincl)
+		err = rawv6_send_hdrinc(sk, msg->msg_iov, len, &fl, &dst, msg->msg_flags);
+	else {
 		lock_sock(sk);
 		err = ip6_append_data(sk, ip_generic_getfrag, msg->msg_iov,
 			len, 0, hlimit, tclass, opt, &fl, (struct rt6_info*)dst,



^ permalink raw reply related

* Re: [RFC][PATCH] Fix another namespace issue with devices assigned to  classes
From: Johannes Berg @ 2010-06-04  8:28 UTC (permalink / raw)
  To: Kay Sievers; +Cc: Eric W. Biederman, Greg KH, netdev
In-Reply-To: <AANLkTingSwM7e8ldyT1FplF5WCHBsCBCwNkb5ZPHe6hB@mail.gmail.com>

On Fri, 2010-06-04 at 10:15 +0200, Kay Sievers wrote:
> On Fri, Jun 4, 2010 at 08:54, Johannes Berg <johannes@sipsolutions.net> wrote:
> > On Wed, 2010-06-02 at 17:53 -0700, Eric W. Biederman wrote:
> >
> >> Johannes this should fix your issue with mac80211_hwsim, where
> >> the device symlink were not destroyed when the driver was removed.
> >
> > It does, thank you.
> >
> > FWIW, I'm happy changing hwsim too, but I don't think I quite understand
> > what you're proposing in your other email so I'll leave it up to you
> > since you now know what is causing the problem :)
> 
> Assuming that hwsim is th parent of the network interface, it should
> us a "struct bus_type" not a "struct class" for the subsystem it
> assigns the devices to.

It's all virtual, so yeah, I guess it is the parent? It currently
creates a virtual struct device in the hwsim class and assigns that to
the netdev parent indirectly via the wiphy or something like that.

> Classes should not be used for anything completely simple, at best not
> be used at all, they are just too simple. We never know about future
> requirements, which usually all go wrong with the non-extendable class
> logic.
> 
> The difference in the code to switch from class to bus should be minimal.

Does that mean cfg80211 (net/wireless/) should also not use a struct
class? I'm not familiar with any of these details, mind helping me out?

johannes

^ permalink raw reply

* Re: [RFC][PATCH] Fix another namespace issue with devices assigned to classes
From: Kay Sievers @ 2010-06-04  8:34 UTC (permalink / raw)
  To: Johannes Berg; +Cc: Eric W. Biederman, Greg KH, netdev
In-Reply-To: <1275640113.9953.8.camel@jlt3.sipsolutions.net>

On Fri, Jun 4, 2010 at 10:28, Johannes Berg <johannes@sipsolutions.net> wrote:
> On Fri, 2010-06-04 at 10:15 +0200, Kay Sievers wrote:
>> On Fri, Jun 4, 2010 at 08:54, Johannes Berg <johannes@sipsolutions.net> wrote:
>> > On Wed, 2010-06-02 at 17:53 -0700, Eric W. Biederman wrote:
>> >
>> >> Johannes this should fix your issue with mac80211_hwsim, where
>> >> the device symlink were not destroyed when the driver was removed.
>> >
>> > It does, thank you.
>> >
>> > FWIW, I'm happy changing hwsim too, but I don't think I quite understand
>> > what you're proposing in your other email so I'll leave it up to you
>> > since you now know what is causing the problem :)
>>
>> Assuming that hwsim is th parent of the network interface, it should
>> us a "struct bus_type" not a "struct class" for the subsystem it
>> assigns the devices to.
>
> It's all virtual, so yeah, I guess it is the parent? It currently
> creates a virtual struct device in the hwsim class and assigns that to
> the netdev parent indirectly via the wiphy or something like that.
>
>> Classes should not be used for anything completely simple, at best not
>> be used at all, they are just too simple. We never know about future
>> requirements, which usually all go wrong with the non-extendable class
>> logic.
>>
>> The difference in the code to switch from class to bus should be minimal.
>
> Does that mean cfg80211 (net/wireless/) should also not use a struct
> class? I'm not familiar with any of these details, mind helping me out?

Everything that might ever have a child device, or might ever need a
sysfs attribute gobal to the subsystem must convert to a bus.

Actually there is not much reason to ever use "struct class" today.
The layout for classes in /sys is not extendable like it is for buses
which put all devices in a devices/ subdir and have the main subsystem
directory to add custom things.

Kay

^ permalink raw reply

* xfrm: warning on negative dst refcount
From: Steffen Klassert @ 2010-06-04 10:40 UTC (permalink / raw)
  To: Eric Dumazet, David Miller; +Cc: netdev


I see the warning below frequently when I'm running iperf on a IPsec
connection. It seems that dst_pop() drops a refcount on a noref
dst_entry. I was able to fix this by changing dst_pop() to a new
function skb_dst_pop() which uses skb_dst_drop(skb) instead of
dst_release(dst) to drop the reference if necessary. I'll send the
patch that fixed the issue for me in repy to this mail.
I don't know that much about the noref work, so I'm not sure whether
this is the right fix, but I got rid of the warning at least.

Steffen

Jun  4 10:21:24 mainline kernel: [ 1334.203913] WARNING: at /home/secunet/git/linux-sinafe-2.6/net/core/dst.c:276 xfrm_output_resume+0x2d3/0x35e()
Jun  4 10:21:24 mainline kernel: [ 1334.203915] Hardware name:         
Jun  4 10:21:24 mainline kernel: [ 1334.203916] Modules linked in: authenc esp4 xfrm4_mode_tunnel aes_x86_64 aes_generic cbc sha1_generic xfrm_user ipv6 acpi_cpufreq mperf cpufreq_userspace cpufreq_stats cpufreq_ondemand freq_table cpufreq_conservative cpufreq_powersave container fan video output sbs sbshc battery af_packet ac fuse loop option usb_wwan usbserial sr_mod cdrom iTCO_wdt ehci_hcd thermal uhci_hcd serio_raw psmouse tpm_tis tpm tpm_bios iTCO_vendor_support pcspkr processor thermal_sys evdev usbcore button ata_generic
Jun  4 10:21:24 mainline kernel: [ 1334.203944] Pid: 3337, comm: dd Tainted: G        W   2.6.35-rc1+ #276
Jun  4 10:21:24 mainline kernel: [ 1334.203946] Call Trace:
Jun  4 10:21:24 mainline kernel: [ 1334.203947]  <IRQ>  [<ffffffff81261fbb>] ? xfrm_output_resume+0x2d3/0x35e
Jun  4 10:21:24 mainline kernel: [ 1334.203952]  [<ffffffff81261fbb>] ? xfrm_output_resume+0x2d3/0x35e
Jun  4 10:21:24 mainline kernel: [ 1334.203955]  [<ffffffff8102700d>] ? warn_slowpath_common+0x78/0x8d
Jun  4 10:21:24 mainline kernel: [ 1334.203958]  [<ffffffff81261fbb>] ? xfrm_output_resume+0x2d3/0x35e
Jun  4 10:21:24 mainline kernel: [ 1334.203961]  [<ffffffff8122bbe0>] ? ip_queue_xmit+0x2bc/0x304
Jun  4 10:21:24 mainline kernel: [ 1334.203964]  [<ffffffff8123b2db>] ? tcp_transmit_skb+0x6d2/0x701
Jun  4 10:21:24 mainline kernel: [ 1334.203967]  [<ffffffff81233df9>] ? tcp_cong_avoid+0xe/0x1d
Jun  4 10:21:24 mainline kernel: [ 1334.203970]  [<ffffffff81235bdf>] ? tcp_fin+0x74/0x17c
Jun  4 10:21:24 mainline kernel: [ 1334.203972]  [<ffffffff81236796>] ? tcp_data_queue+0x2fc/0xb64
Jun  4 10:21:24 mainline kernel: [ 1334.203975]  [<ffffffff81239d61>] ? tcp_rcv_state_process+0x927/0x99b
Jun  4 10:21:24 mainline kernel: [ 1334.203978]  [<ffffffff8123f563>] ? tcp_v4_do_rcv+0x18b/0x1d0
Jun  4 10:21:24 mainline kernel: [ 1334.203981]  [<ffffffff81241160>] ? tcp_v4_rcv+0x485/0x721
Jun  4 10:21:24 mainline kernel: [ 1334.203985]  [<ffffffff81227154>] ? ip_local_deliver+0xda/0x165
Jun  4 10:21:24 mainline kernel: [ 1334.203988]  [<ffffffff812276f4>] ? ip_rcv+0x515/0x53d
Jun  4 10:21:24 mainline kernel: [ 1334.203991]  [<ffffffff81209192>] ? process_backlog+0x99/0x174
Jun  4 10:21:24 mainline kernel: [ 1334.203994]  [<ffffffff81208e41>] ? net_rx_action+0xab/0x153
Jun  4 10:21:24 mainline kernel: [ 1334.203997]  [<ffffffff8102b8f5>] ? __do_softirq+0x8d/0x107
Jun  4 10:21:24 mainline kernel: [ 1334.204000]  [<ffffffff81002aec>] ? call_softirq+0x1c/0x28
Jun  4 10:21:24 mainline kernel: [ 1334.204003]  [<ffffffff810043a5>] ? do_softirq+0x31/0x64
Jun  4 10:21:24 mainline kernel: [ 1334.204005]  [<ffffffff8102b826>] ? irq_exit+0x36/0x78
Jun  4 10:21:24 mainline kernel: [ 1334.204008]  [<ffffffff81003c4f>] ? do_IRQ+0xa7/0xbd
Jun  4 10:21:24 mainline kernel: [ 1334.204011]  [<ffffffff8127dad3>] ? ret_from_intr+0x0/0xa
Jun  4 10:21:24 mainline kernel: [ 1334.204012]  <EOI>  [<ffffffff8127d810>] ? _raw_spin_unlock_irqrestore+0x4/0x5
Jun  4 10:21:24 mainline kernel: [ 1334.204018]  [<ffffffff8108e618>] ? pipe_write+0x413/0x447
Jun  4 10:21:24 mainline kernel: [ 1334.204021]  [<ffffffff81087aab>] ? do_sync_write+0xb3/0xf7
Jun  4 10:21:24 mainline kernel: [ 1334.204024]  [<ffffffff810c7007>] ? kmsg_read+0x44/0x4e
Jun  4 10:21:24 mainline kernel: [ 1334.204027]  [<ffffffff8127c26c>] ? schedule+0x525/0x5db
Jun  4 10:21:24 mainline kernel: [ 1334.204030]  [<ffffffff81088120>] ? vfs_write+0xad/0x149
Jun  4 10:21:24 mainline kernel: [ 1334.204033]  [<ffffffff81088669>] ? sys_write+0x45/0x6e
Jun  4 10:21:24 mainline kernel: [ 1334.204035]  [<ffffffff81001ceb>] ? system_call_fastpath+0x16/0x1b
Jun  4 10:21:24 mainline kernel: [ 1334.204037] ---[ end trace ab41f96596a62e77 ]---


^ permalink raw reply

* [PATCH] net: check for refcount if pop a stacked dst_entry
From: Steffen Klassert @ 2010-06-04 10:41 UTC (permalink / raw)
  To: Eric Dumazet, David Miller; +Cc: netdev
In-Reply-To: <20100604104012.GA13408@secunet.com>

xfrm triggers a warning if dst_pop() drops a refcount
on a noref dst. This patch changes dst_pop() to
skb_dst_pop(). skb_dst_pop() drops the refcnt only
on a refcounted dst.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 include/net/dst.h      |    6 +++---
 net/xfrm/xfrm_output.c |    2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index 612069b..acd1538 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -250,11 +250,11 @@ static inline void skb_tunnel_rx(struct sk_buff *skb, struct net_device *dev)
  * Linux networking.  Thus, destinations are stackable.
  */
 
-static inline struct dst_entry *dst_pop(struct dst_entry *dst)
+static inline struct dst_entry *skb_dst_pop(struct sk_buff *skb)
 {
-	struct dst_entry *child = dst_clone(dst->child);
+	struct dst_entry *child = dst_clone(skb_dst(skb)->child);
 
-	dst_release(dst);
+	skb_dst_drop(skb);
 	return child;
 }
 
diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index 6a32915..db62a06 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -95,7 +95,7 @@ resume:
 			goto error_nolock;
 		}
 
-		dst = dst_pop(dst);
+		dst = skb_dst_pop(skb);
 		if (!dst) {
 			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTERROR);
 			err = -EHOSTUNREACH;
-- 
1.5.6.5


^ permalink raw reply related

* Re: [PATCH] net: check for refcount if pop a stacked dst_entry
From: Eric Dumazet @ 2010-06-04 10:51 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: David Miller, netdev
In-Reply-To: <20100604104115.GB13408@secunet.com>

Le vendredi 04 juin 2010 à 12:41 +0200, Steffen Klassert a écrit :
> xfrm triggers a warning if dst_pop() drops a refcount
> on a noref dst. This patch changes dst_pop() to
> skb_dst_pop(). skb_dst_pop() drops the refcnt only
> on a refcounted dst.
> 
> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
> ---
>  include/net/dst.h      |    6 +++---
>  net/xfrm/xfrm_output.c |    2 +-
>  2 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/include/net/dst.h b/include/net/dst.h
> index 612069b..acd1538 100644
> --- a/include/net/dst.h
> +++ b/include/net/dst.h
> @@ -250,11 +250,11 @@ static inline void skb_tunnel_rx(struct sk_buff *skb, struct net_device *dev)
>   * Linux networking.  Thus, destinations are stackable.
>   */
>  
> -static inline struct dst_entry *dst_pop(struct dst_entry *dst)
> +static inline struct dst_entry *skb_dst_pop(struct sk_buff *skb)
>  {
> -	struct dst_entry *child = dst_clone(dst->child);
> +	struct dst_entry *child = dst_clone(skb_dst(skb)->child);
>  



> -	dst_release(dst);
> +	skb_dst_drop(skb);
>  	return child;
>  }

Hmm, this might fix the thing, but we probably can do it without the
dst_clone(), if you replace the 

skb_dst_set(skb, dst);

by 

skb_dst_set_noref(skb, dst);

in xfrm_output_one() ?

>  
> diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
> index 6a32915..db62a06 100644
> --- a/net/xfrm/xfrm_output.c
> +++ b/net/xfrm/xfrm_output.c
> @@ -95,7 +95,7 @@ resume:
>  			goto error_nolock;
>  		}
>  
> -		dst = dst_pop(dst);
> +		dst = skb_dst_pop(skb);
>  		if (!dst) {
>  			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTERROR);
>  			err = -EHOSTUNREACH;



^ permalink raw reply

* Re: [PATCH] net: check for refcount if pop a stacked dst_entry
From: Steffen Klassert @ 2010-06-04 11:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1275648679.2482.43.camel@edumazet-laptop>

On Fri, Jun 04, 2010 at 12:51:19PM +0200, Eric Dumazet wrote:
> 
> Hmm, this might fix the thing, but we probably can do it without the
> dst_clone(), if you replace the 
> 
> skb_dst_set(skb, dst);
> 
> by 
> 
> skb_dst_set_noref(skb, dst);
> 
> in xfrm_output_one() ?
> 

Yes, this should work too. I'll update the patch to use
skb_dst_set_noref() in xfrm_output_one() and remove the 
dst_clone() in skb_dst_pop().

^ permalink raw reply

* no reassembly for outgoing packets on RAW socket
From: Jiri Olsa @ 2010-06-04 11:27 UTC (permalink / raw)
  To: netdev

hi,

I'd like to be able to sendout a single IP packet with MF flag set.

When using RAW sockets the packet will get stuck in the
netfilter (NF_INET_LOCAL_OUT nf_defrag_ipv4 reassembly unit)
and wont ever make it out..

I made a change which bypass the outgoing reassembly for
RAW sockets, but I'm not sure wether it's too invasive..

Is there any standard for RAW sockets behaviour?
Or another way around? :)

thanks,
jirka


---
diff --git a/net/ipv4/netfilter/nf_defrag_ipv4.c b/net/ipv4/netfilter/nf_defrag_ipv4.c
index cb763ae..5ef8ab2 100644
--- a/net/ipv4/netfilter/nf_defrag_ipv4.c
+++ b/net/ipv4/netfilter/nf_defrag_ipv4.c
@@ -74,6 +74,10 @@ static unsigned int ipv4_conntrack_defrag(unsigned int hooknum,
 		return NF_ACCEPT;
 #endif
 #endif
+	/* Do not reassemble for raw sockets. */
+	if (skb->sk && skb->sk->sk_type == SOCK_RAW)
+		return NF_ACCEPT;
+
 	/* Gather fragments. */
 	if (ip_hdr(skb)->frag_off & htons(IP_MF | IP_OFFSET)) {
 		enum ip_defrag_users user = nf_ct_defrag_user(hooknum, skb);
diff --git a/net/ipv4/netfilter/nf_nat_standalone.c b/net/ipv4/netfilter/nf_nat_standalone.c
index beb2581..a9aa19c 100644
--- a/net/ipv4/netfilter/nf_nat_standalone.c
+++ b/net/ipv4/netfilter/nf_nat_standalone.c
@@ -86,8 +86,14 @@ nf_nat_fn(unsigned int hooknum,
 	enum nf_nat_manip_type maniptype = HOOK2MANIP(hooknum);
 
 	/* We never see fragments: conntrack defrags on pre-routing
-	   and local-out, and nf_nat_out protects post-routing. */
-	NF_CT_ASSERT(!(ip_hdr(skb)->frag_off & htons(IP_MF | IP_OFFSET)));
+	   and local-out, and nf_nat_out protects post-routing.
+	   With the exception of RAW sockets. */
+#ifdef CONFIG_NETFILTER_DEBUG
+	int raw  = (skb->sk && skb->sk->sk_type == SOCK_RAW);
+	int frag = (ip_hdr(skb)->frag_off & htons(IP_MF | IP_OFFSET));
+
+	NF_CT_ASSERT(!frag || (frag && raw));
+#endif
 
 	ct = nf_ct_get(skb, &ctinfo);
 	/* Can't track?  It's not due to stress, or conntrack would

^ permalink raw reply related

* Re: [PATCH nf-next-2.6] netfilter: vmalloc_node cleanup
From: Patrick McHardy @ 2010-06-04 11:32 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Netfilter Development Mailinglist
In-Reply-To: <1275454950.2638.121.camel@edumazet-laptop>

Eric Dumazet wrote:
> Using vmalloc_node(size, numa_node_id()) for temporary storage is not
> needed. vmalloc(size) is more respectful of user NUMA policy.

Applied, thanks Eric.

^ permalink raw reply

* Re: [RFC nf-next-2.6] conntrack: per cpu nf_conntrack_untracked
From: Patrick McHardy @ 2010-06-04 11:40 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Netfilter Developers, netdev
In-Reply-To: <1275409203.2738.227.camel@edumazet-laptop>

Eric Dumazet wrote:
> Le mardi 01 juin 2010 à 12:41 +0200, Patrick McHardy a écrit :
> 
>>> BTW, I notice nf_conntrack_untracked is incorrectly annotated
>>> '__read_mostly'.
>>>
>>> It can be written very often :(
>>>
>>> Should'nt we special case it and let be really const ?
>> That would need quite a bit of special-casing to avoid touching
>> the reference counts. So far this is completely hidden, so I'd
>> say it just shouldn't be marked __read_mostly. Alternatively we
>> can make "untracked" a nfctinfo state.
> 
> I tried this suggestion, (a new IP_CT_UNTRACKED ctinfo), over a per_cpu
> untracked ct, but its a bit hard.
> 
> For example, I cannot find a way to change ctnetlink_conntrack_event() :
> 
> 	if (ct == &nf_conntrack_untracked)
> 		return 0;
> 
> Maybe this piece of code is not necessary, we should not come here
> anyway, or it means several packets could store events for this (shared)
> ct ?

We probably shouldn't be reaching that code since that would mean
that we previously did modifications to the untracked conntrack.
But a quick audit shows that f.i. xt_connmark will do just that.

> Obviously, an IPS_UNTRACKED bit would be much easier to implement.
> Would it be acceptable ?

That also would be fine. However the main idea behind using a nfctinfo
bit was that we wouldn't need the untracked conntrack anymore at all.
But I guess a per-cpu untrack conntrack would already be an improvement
over the current situation.

^ permalink raw reply

* [PATCH v2] net: check for refcount if pop a stacked dst_entry
From: Steffen Klassert @ 2010-06-04 11:57 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1275648679.2482.43.camel@edumazet-laptop>

xfrm triggers a warning if dst_pop() drops a refcount
on a noref dst. This patch changes dst_pop() to
skb_dst_pop(). skb_dst_pop() drops the refcnt only
on a refcounted dst. Also we don't clone the child
dst_entry, so it is not refcounted and we can use
skb_dst_set_noref() in xfrm_output_one().

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
 include/net/dst.h      |    6 +++---
 net/xfrm/xfrm_output.c |    4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index 612069b..81d1413 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -250,11 +250,11 @@ static inline void skb_tunnel_rx(struct sk_buff *skb, struct net_device *dev)
  * Linux networking.  Thus, destinations are stackable.
  */
 
-static inline struct dst_entry *dst_pop(struct dst_entry *dst)
+static inline struct dst_entry *skb_dst_pop(struct sk_buff *skb)
 {
-	struct dst_entry *child = dst_clone(dst->child);
+	struct dst_entry *child = skb_dst(skb)->child;
 
-	dst_release(dst);
+	skb_dst_drop(skb);
 	return child;
 }
 
diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index 6a32915..a3cca0a 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -95,13 +95,13 @@ resume:
 			goto error_nolock;
 		}
 
-		dst = dst_pop(dst);
+		dst = skb_dst_pop(skb);
 		if (!dst) {
 			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTERROR);
 			err = -EHOSTUNREACH;
 			goto error_nolock;
 		}
-		skb_dst_set(skb, dst);
+		skb_dst_set_noref(skb, dst);
 		x = dst->xfrm;
 	} while (x && !(x->outer_mode->flags & XFRM_MODE_FLAG_TUNNEL));
 
-- 
1.5.6.5


^ permalink raw reply related

* Re: no reassembly for outgoing packets on RAW socket
From: Patrick McHardy @ 2010-06-04 12:03 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: netdev
In-Reply-To: <20100604112708.GA1958@jolsa.lab.eng.brq.redhat.com>

Jiri Olsa wrote:
> hi,
> 
> I'd like to be able to sendout a single IP packet with MF flag set.
> 
> When using RAW sockets the packet will get stuck in the
> netfilter (NF_INET_LOCAL_OUT nf_defrag_ipv4 reassembly unit)
> and wont ever make it out..
> 
> I made a change which bypass the outgoing reassembly for
> RAW sockets, but I'm not sure wether it's too invasive..

That would break reassembly (and thus connection tracking) for cases
where its really intended.

> Is there any standard for RAW sockets behaviour?
> Or another way around? :)

You could use the NOTRACK target to bypass connection tracking.

^ permalink raw reply

* [PATCH net-next-2.6] net: Remove unnecessary net action assertion
From: jamal @ 2010-06-04 12:06 UTC (permalink / raw)
  To: davem; +Cc: Herbert Xu, Jiri Pirko, netdev

[-- Attachment #1: Type: text/plain, Size: 75 bytes --]

I will followup with another (independent) change on pedit.

cheers,
jamal

[-- Attachment #2: act-nomunge --]
[-- Type: text/plain, Size: 1141 bytes --]

commit 45c644796fe2aa834918b15d7b41e57ccf86c1b3
Author: Jamal Hadi Salim <hadi@cyberus.ca>
Date:   Fri Jun 4 06:16:30 2010 -0400

    net: Remove unnecessary net action assertion
    
    The extra assertion to allow packet munging only when there are
    no other ptypes listening which may have worked around an old bug
    is unnecessary. It is sufficient to check if the skb is cloned before
    trampling on it. Thanks to Herbert Xu for being persistent and patient
    in getting this across.
    [Note that cloning checks and assertions are the general rule used
    by tc actions (documentation/networking/tc-actions-env-rules.txt)].
    
    Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>

diff --git a/net/core/dev.c b/net/core/dev.c
index ec01a59..b272752 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2662,9 +2662,6 @@ static inline struct sk_buff *handle_ing(struct sk_buff *skb,
 	if (*pt_prev) {
 		*ret = deliver_skb(skb, *pt_prev, orig_dev);
 		*pt_prev = NULL;
-	} else {
-		/* Huh? Why does turning on AF_PACKET affect this? */
-		skb->tc_verd = SET_TC_OK2MUNGE(skb->tc_verd);
 	}
 
 	switch (ing_filter(skb)) {

^ permalink raw reply related

* Re: [PATCH v2] net: check for refcount if pop a stacked dst_entry
From: Eric Dumazet @ 2010-06-04 12:06 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: David Miller, netdev
In-Reply-To: <20100604115737.GA5570@secunet.com>

Le vendredi 04 juin 2010 à 13:57 +0200, Steffen Klassert a écrit :
> xfrm triggers a warning if dst_pop() drops a refcount
> on a noref dst. This patch changes dst_pop() to
> skb_dst_pop(). skb_dst_pop() drops the refcnt only
> on a refcounted dst. Also we don't clone the child
> dst_entry, so it is not refcounted and we can use
> skb_dst_set_noref() in xfrm_output_one().
> 
> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
> ---

Thanks a lot Steffen !

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>




^ permalink raw reply

* still having r8169 woes with XID 18000000
From: Timo Teräs @ 2010-06-04 12:10 UTC (permalink / raw)
  To: françois romieu; +Cc: netdev

Hi,

After fixing the MAC issues earlier, I'm still seeing some weird trouble
with my RTL8169sc/8110sc / XID 18000000 boards.

The box(es) were originally running 2.6.30.x kernel and everything
worked without major problems. But after upgrading to 2.6.32.x (and even
with most of the newer fixes included too), it seems that the sometimes
(not too often) some of the interfaces just won't work after reboot
(cold or hard). It's a 3-in-1 board, and usually when this happens one
of the interfaces won't work but the other two do work.

Whenever an interface is "broken", the following conditions are true:
 - forcing it to 10mbit/s and disabling autoneg will make it work
 - when it's not working ethtool -S reports rx_errors and align_errors
increasing
 - when autoneg is on, ethtool says that "Link Detected: no"

I do see that between the above two kernel versions PHY init code was
introduced for this XID (RTL_GIGA_MAC_VER_05) in commit 2e955856ff. I
wonder if it makes sense trying to revert this. Should the NIC still
work? Do you see any other suspicious commits?

Unrelated, I notice that if I have 1GB links autonegotiated, 'ip link'
will show them as "state UNKNOWN". Forced manual links get "state UP" as
expected. This is for the operstate field.

- Timo

^ permalink raw reply

* Re: [RFC nf-next-2.6] conntrack: per cpu nf_conntrack_untracked
From: Changli Gao @ 2010-06-04 12:10 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Eric Dumazet, Netfilter Developers, netdev
In-Reply-To: <4C08E62A.9020607@trash.net>

On Fri, Jun 4, 2010 at 7:40 PM, Patrick McHardy <kaber@trash.net> wrote:
> Eric Dumazet wrote:
>> Obviously, an IPS_UNTRACKED bit would be much easier to implement.
>> Would it be acceptable ?
>
> That also would be fine. However the main idea behind using a nfctinfo
> bit was that we wouldn't need the untracked conntrack anymore at all.
> But I guess a per-cpu untrack conntrack would already be an improvement
> over the current situation.

I think Eric didn't mean ip_conntrack_info but ip_conntrack_status
bit. Since we have had a IPS_TEMPLATE bit, I think another
IPS_UNTRACKED bit is also acceptable.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [RFC nf-next-2.6] conntrack: per cpu nf_conntrack_untracked
From: Patrick McHardy @ 2010-06-04 12:29 UTC (permalink / raw)
  To: Changli Gao; +Cc: Eric Dumazet, Netfilter Developers, netdev
In-Reply-To: <AANLkTilKEpFTg5bH8d9UE3a3DVJNAGYz10Jgkt6lXtJ0@mail.gmail.com>

Changli Gao wrote:
> On Fri, Jun 4, 2010 at 7:40 PM, Patrick McHardy <kaber@trash.net> wrote:
>> Eric Dumazet wrote:
>>> Obviously, an IPS_UNTRACKED bit would be much easier to implement.
>>> Would it be acceptable ?
>> That also would be fine. However the main idea behind using a nfctinfo
>> bit was that we wouldn't need the untracked conntrack anymore at all.
>> But I guess a per-cpu untrack conntrack would already be an improvement
>> over the current situation.
> 
> I think Eric didn't mean ip_conntrack_info but ip_conntrack_status
> bit. Since we have had a IPS_TEMPLATE bit, I think another
> IPS_UNTRACKED bit is also acceptable.

Yes, of course. But using one of these bits implies that we'd still
have the untracked conntrack.

^ permalink raw reply

* Re: [RFC nf-next-2.6] conntrack: per cpu nf_conntrack_untracked
From: Eric Dumazet @ 2010-06-04 12:36 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Changli Gao, Netfilter Developers, netdev
In-Reply-To: <4C08F1A4.2050906@trash.net>

Le vendredi 04 juin 2010 à 14:29 +0200, Patrick McHardy a écrit :
> Changli Gao wrote:
> > On Fri, Jun 4, 2010 at 7:40 PM, Patrick McHardy <kaber@trash.net> wrote:
> >> Eric Dumazet wrote:
> >>> Obviously, an IPS_UNTRACKED bit would be much easier to implement.
> >>> Would it be acceptable ?
> >> That also would be fine. However the main idea behind using a nfctinfo
> >> bit was that we wouldn't need the untracked conntrack anymore at all.
> >> But I guess a per-cpu untrack conntrack would already be an improvement
> >> over the current situation.
> > 
> > I think Eric didn't mean ip_conntrack_info but ip_conntrack_status
> > bit. Since we have had a IPS_TEMPLATE bit, I think another
> > IPS_UNTRACKED bit is also acceptable.
> 
> Yes, of course. But using one of these bits implies that we'd still
> have the untracked conntrack.

Yes, it was my idea, with a per_cpu untracked conntrack.

I'll submit a patch, thanks.



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: still having r8169 woes with XID 18000000
From: Phil Sutter @ 2010-06-04 12:36 UTC (permalink / raw)
  To: Timo Teräs; +Cc: françois romieu, netdev
In-Reply-To: <4C08ED47.1030800@iki.fi>

Hi,

On Fri, Jun 04, 2010 at 03:10:47PM +0300, Timo Teräs wrote:
> After fixing the MAC issues earlier, I'm still seeing some weird trouble
> with my RTL8169sc/8110sc / XID 18000000 boards.
> 
> The box(es) were originally running 2.6.30.x kernel and everything
> worked without major problems. But after upgrading to 2.6.32.x (and even
> with most of the newer fixes included too), it seems that the sometimes
> (not too often) some of the interfaces just won't work after reboot
> (cold or hard). It's a 3-in-1 board, and usually when this happens one
> of the interfaces won't work but the other two do work.
> 
> Whenever an interface is "broken", the following conditions are true:
>  - forcing it to 10mbit/s and disabling autoneg will make it work
>  - when it's not working ethtool -S reports rx_errors and align_errors
> increasing
>  - when autoneg is on, ethtool says that "Link Detected: no"

This (your last point) is about what we were experiencing at work using
PCI-based Gigabit Realtek NICs. Our solution to the problem was to
switch to a different NIC (Intel e1000), which obviously solves any
problems. ;)

But I've done some tests before, mainly being inspired by these mails:
http://permalink.gmane.org/gmane.linux.network/160136
http://permalink.gmane.org/gmane.linux.network/160280
and after some feedback from the mainboard manufacturer I've tested the
out-of-tree driver Realtek provides (version 6.013.00), which seems to
not have this issue. Very interesting results show up when comparing
6.013 with 6.012 (citing myself):

Comparing r8169-6.013 with it's predecessor 6.012, you'll find a newly
enabled function rtl8169_phy_power_up() as well as some more invocations
of rtl8169_phy_power_down().

This is probably the solution to these (at least in our case) very
sporadic, but highly annoying, problems. In fact, when our NIC didn't
detect any link, it needed a full power-cycle (no success with
reset-button), so almost not workaroundable.

HTH, Phil

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox