* Re: Packetlost when "tc qdisc del dev eth0 root"
From: Jarek Poplawski @ 2008-01-16 8:02 UTC (permalink / raw)
To: Patrick McHardy; +Cc: Badalian Vyacheslav, netdev
In-Reply-To: <478D8FC8.9000107@trash.net>
On Wed, Jan 16, 2008 at 06:02:00AM +0100, Patrick McHardy wrote:
...
> This would need support from the qdiscs to do it properly. Looks
> non-trivial for HTB/HFSC/CBQ, but the others shouldn't be that hard.
Yes. At first I've thought this would need quite a lot of work, but
it seems, there could be probably used something very simple too,
like e.g. a 'dummy' sched switcher, which after replacing as a root
the old qdisc and knowing the pointer to the new one could simply
call for dequeing the old one and the new one for everything else.
Then, after completely dequeuing it would call destroy for the old
qdisc and probably switch itself with the new one as a root. If this
new one were created temporarily e.g. on a dummyX dev, and the switch
qdisc added to dummyY (as a temporary holder) with ethX and dummyX as
parameters, it seems this could be done without any API changes.
(But, of course, something more sophisticated should be even better.)
Regards,
Jarek P.
^ permalink raw reply
* Re: Packetlost when "tc qdisc del dev eth0 root"
From: Patrick McHardy @ 2008-01-16 8:05 UTC (permalink / raw)
To: Jarek Poplawski; +Cc: Badalian Vyacheslav, netdev
In-Reply-To: <20080116080259.GB1638@ff.dom.local>
Jarek Poplawski wrote:
> On Wed, Jan 16, 2008 at 06:02:00AM +0100, Patrick McHardy wrote:
> ...
>> This would need support from the qdiscs to do it properly. Looks
>> non-trivial for HTB/HFSC/CBQ, but the others shouldn't be that hard.
>
> Yes. At first I've thought this would need quite a lot of work, but
> it seems, there could be probably used something very simple too,
> like e.g. a 'dummy' sched switcher, which after replacing as a root
> the old qdisc and knowing the pointer to the new one could simply
> call for dequeing the old one and the new one for everything else.
> Then, after completely dequeuing it would call destroy for the old
> qdisc and probably switch itself with the new one as a root. If this
> new one were created temporarily e.g. on a dummyX dev, and the switch
> qdisc added to dummyY (as a temporary holder) with ethX and dummyX as
> parameters, it seems this could be done without any API changes.
> (But, of course, something more sophisticated should be even better.)
Yes, thats one possibility (without the dummy device though please).
But I wonder what this would actually be useful for. I don't think
replacing the root qdisc by a different type is a common scenario,
for the same type you can simply use "tc qdisc change", "tc class
change" and "tc class replace".
Badalian, what are you actually doing?
^ permalink raw reply
* Re: Packetlost when "tc qdisc del dev eth0 root"
From: Badalian Vyacheslav @ 2008-01-16 8:35 UTC (permalink / raw)
To: Patrick McHardy; +Cc: Jarek Poplawski, netdev
In-Reply-To: <478DBADB.1080200@trash.net>
>
>
> Yes, thats one possibility (without the dummy device though please).
> But I wonder what this would actually be useful for. I don't think
> replacing the root qdisc by a different type is a common scenario,
> for the same type you can simply use "tc qdisc change", "tc class
> change" and "tc class replace".
>
> Badalian, what are you actually doing?
>
Sorry. Resend to all.
I simple recreate all rules. I change idea from do many
add,change,delete because have many kernel panics on many kernels 2.6.x
First i have panics on "delete filter" operation... was fix it...
great.. then have panics on "delete htb" operation... long time wait to
fix it (maintainer of tc not have time to fix it i think).. but have 1-5
panics at day.... i think my clients hate me =) i rewrite script to
simple recreate all rules... in this method i have small packetlost 1
time in hour (then recreate root qdisc), but not have panics... now i
use new logic of scripts... =)
maybe need go back to accurate logic of scripts, but i to fear kernel
panics =)
P.S. Feature request:
tc class show [ dev STRING ] [ root | parent CLASSID ]
may add classid filter to show?
i need do like this
/sbin/tc -s class show dev eth0 | grep -A2 "htb $HTB "
to get stats of class... but i every run real look all table.. its not good
Thanks that are you interesting history of my problems =)
Slavon.
^ permalink raw reply
* Re: Packetlost when "tc qdisc del dev eth0 root"
From: Jarek Poplawski @ 2008-01-16 8:52 UTC (permalink / raw)
To: Patrick McHardy; +Cc: Badalian Vyacheslav, netdev
In-Reply-To: <478DBADB.1080200@trash.net>
On Wed, Jan 16, 2008 at 09:05:47AM +0100, Patrick McHardy wrote:
...
> Yes, thats one possibility (without the dummy device though please).
> But I wonder what this would actually be useful for. I don't think
> replacing the root qdisc by a different type is a common scenario,
> for the same type you can simply use "tc qdisc change", "tc class
> change" and "tc class replace".
>
> Badalian, what are you actually doing?
I'm not sure Vyacheslav needs just this, but I've thought about the
possibility to recreate the 'shadow' copy of currently used qdisc
tree (with some updates of course) while it's running. So, the
possibility of using all the same handles and classids, and even
dev names if possible, and doing such a switch without any visible
break.
Jarek P.
^ permalink raw reply
* Re: Packetlost when "tc qdisc del dev eth0 root"
From: Badalian Vyacheslav @ 2008-01-16 8:54 UTC (permalink / raw)
To: Jarek Poplawski; +Cc: Patrick McHardy, netdev
In-Reply-To: <20080116085254.GB2307@ff.dom.local>
Jarek Poplawski пишет:
> On Wed, Jan 16, 2008 at 09:05:47AM +0100, Patrick McHardy wrote:
> ...
>
>> Yes, thats one possibility (without the dummy device though please).
>> But I wonder what this would actually be useful for. I don't think
>> replacing the root qdisc by a different type is a common scenario,
>> for the same type you can simply use "tc qdisc change", "tc class
>> change" and "tc class replace".
>>
>> Badalian, what are you actually doing?
>>
>
> I'm not sure Vyacheslav needs just this, but I've thought about the
> possibility to recreate the 'shadow' copy of currently used qdisc
> tree (with some updates of course) while it's running. So, the
> possibility of using all the same handles and classids, and even
> dev names if possible, and doing such a switch without any visible
> break.
>
> Jarek P.
>
>
I also think that system must forward all packets what it get if it not
dropped manual (by iptables or shaper).
Maybe someone need to test delete big TREE.. simple delete, not
recreate... linux unavailable some time (if its realy big table its time
may be 10-20 sec on 2xXeon).
I think need helper to do that operations more accurate. Now see
situation that linux is PPTP server... its get 2000k connection... try
delete qdisc on eth0 (incoming from wan to pptp clients)... i think many
sessions will drop.
Thanks!
^ permalink raw reply
* Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
From: Frans Pop @ 2008-01-16 8:56 UTC (permalink / raw)
To: David Miller; +Cc: jesse.brandeburg, slavon, netdev, linux-kernel
In-Reply-To: <20080115.210214.170759690.davem@davemloft.net>
On Wednesday 16 January 2008, David Miller wrote:
> Ok, here is the patch I'll propose to fix this. The goal is to make
> it as simple as possible without regressing the thing we were trying
> to fix.
Looks good to me. Tested with -rc8.
Cheers,
FJP
^ permalink raw reply
* Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
From: Badalian Vyacheslav @ 2008-01-16 9:02 UTC (permalink / raw)
To: David Miller; +Cc: elendil, netdev, linux-kernel
In-Reply-To: <20080114.215317.38045859.davem@davemloft.net>
applied to 2.6.24-rc7-git2
Have messages
Also have regression after apply patch.
System may do above 800mbs traffic before patch. After its "exit polling
mode?" (4 CPU, 1 cpu get 100% si (process ksoftirqd/0), 3 CPU is IDLE)
After patch system was go to "exit polling mode" at above 600mbs.
Thanks.
> From: Frans Pop <elendil@planet.nl>
> Date: Tue, 15 Jan 2008 06:25:10 +0100
>
>
>> kernel: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
>>
>
> Does this make the problem go away?
>
> (Note this isn't the final correct patch we should apply. There
> is no reason why this revert back to the older ->poll() logic
> here should have any effect on the TX hang triggering...)
>
> diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
> index 13d57b0..cada32c 100644
> --- a/drivers/net/e1000/e1000_main.c
> +++ b/drivers/net/e1000/e1000_main.c
> @@ -3919,7 +3919,7 @@ e1000_clean(struct napi_struct *napi, int budget)
> {
> struct e1000_adapter *adapter = container_of(napi, struct e1000_adapter, napi);
> struct net_device *poll_dev = adapter->netdev;
> - int work_done = 0;
> + int tx_work = 0, work_done = 0;
>
> /* Must NOT use netdev_priv macro here. */
> adapter = poll_dev->priv;
> @@ -3929,8 +3929,8 @@ e1000_clean(struct napi_struct *napi, int budget)
> * simultaneously. A failure obtaining the lock means
> * tx_ring[0] is currently being cleaned anyway. */
> if (spin_trylock(&adapter->tx_queue_lock)) {
> - e1000_clean_tx_irq(adapter,
> - &adapter->tx_ring[0]);
> + tx_work = e1000_clean_tx_irq(adapter,
> + &adapter->tx_ring[0]);
> spin_unlock(&adapter->tx_queue_lock);
> }
>
> @@ -3938,7 +3938,7 @@ e1000_clean(struct napi_struct *napi, int budget)
> &work_done, budget);
>
> /* If budget not fully consumed, exit the polling mode */
> - if (work_done < budget) {
> + if (!tx_work && (work_done < budget)) {
> if (likely(adapter->itr_setting & 3))
> e1000_set_itr(adapter);
> netif_rx_complete(poll_dev, napi);
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply
* Re: Packetlost when "tc qdisc del dev eth0 root"
From: Jarek Poplawski @ 2008-01-16 9:42 UTC (permalink / raw)
To: Badalian Vyacheslav; +Cc: Patrick McHardy, netdev
In-Reply-To: <478DC1D7.90401@bigtelecom.ru>
On Wed, Jan 16, 2008 at 11:35:35AM +0300, Badalian Vyacheslav wrote:
...
> I simple recreate all rules. I change idea from do many add,change,delete
> because have many kernel panics on many kernels 2.6.x
> First i have panics on "delete filter" operation... was fix it... great..
> then have panics on "delete htb" operation... long time wait to fix it
> (maintainer of tc not have time to fix it i think).. but have 1-5 panics at
> day.... i think my clients hate me =) i rewrite script to simple recreate
> all rules... in this method i have small packetlost 1 time in hour (then
> recreate root qdisc), but not have panics... now i use new logic of
> scripts... =)
> maybe need go back to accurate logic of scripts, but i to fear kernel
> panics =)
BTW, I don't know about others, but it seems this bugzilla #9632 waits
for your testing, to find the reason of this bug... IMHO, if you can't
try this now, it's better to close this or/and at least add some
comment. And if there are any other bugs unfixed which are
reproducible and you can test (or have tested) some patches, please
resend them as new threads to the list. Alas, omitting the panics
can't help in removing these bugs.
Jarek P.
^ permalink raw reply
* [PATCH] net: NEWEMAC: Remove "rgmii-interface" from rgmii matching table
From: Stefan Roese @ 2008-01-16 9:37 UTC (permalink / raw)
To: linuxppc-dev, netdev; +Cc: benh
With the removal the the "rgmii-interface" device_type property from the
dts files, the newemac driver needs an update to only rely on compatible
property.
Signed-off-by: Stefan Roese <sr@denx.de>
---
drivers/net/ibm_newemac/rgmii.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)
diff --git a/drivers/net/ibm_newemac/rgmii.c b/drivers/net/ibm_newemac/rgmii.c
index 9bc1132..5757788 100644
--- a/drivers/net/ibm_newemac/rgmii.c
+++ b/drivers/net/ibm_newemac/rgmii.c
@@ -302,7 +302,6 @@ static int __devexit rgmii_remove(struct of_device *ofdev)
static struct of_device_id rgmii_match[] =
{
{
- .type = "rgmii-interface",
.compatible = "ibm,rgmii",
},
{
--
1.5.4.rc3
^ permalink raw reply related
* Re: [PATCH] net: NEWEMAC: Remove "rgmii-interface" from rgmii matching table
From: David Gibson @ 2008-01-16 9:39 UTC (permalink / raw)
To: Stefan Roese; +Cc: linuxppc-dev, netdev
In-Reply-To: <1200476230-14026-1-git-send-email-sr@denx.de>
On Wed, Jan 16, 2008 at 10:37:10AM +0100, Stefan Roese wrote:
> With the removal the the "rgmii-interface" device_type property from the
> dts files, the newemac driver needs an update to only rely on compatible
> property.
In fact, this patch should go in before the one changing the dts
files.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
^ permalink raw reply
* Re: SO_RCVBUF doesn't change receiver advertised window
From: Bill Fink @ 2008-01-16 9:50 UTC (permalink / raw)
To: Ritesh Kumar; +Cc: netdev
In-Reply-To: <f47983b00801151236l3b06f6dci35898fee71f6a942@mail.gmail.com>
On Tue, 15 Jan 2008, Ritesh Kumar wrote:
> Hi,
> I am using linux 2.6.20 and am trying to limit the receiver window
> size for a TCP connection. However, it seems that auto tuning is not
> turning itself off even after I use the syscall
>
> rwin=65536
> setsockopt(sock, SOL_SOCKET, SO_RCVBUF, &rwin, sizeof(rwin));
>
> and verify using
>
> getsockopt(sock, SOL_SOCKET, SO_RCVBUF, &rwin, &rwin_size);
>
> that RCVBUF indeed is getting set (the value returned from getsockopt
> is double that, 131072).
Linux doubles what you requested, and then uses (by default) 1/4
of the socket space for overhead, so you effectively get 1.5 times
what you requested as an actual advertised receiver window, which
means since you specified 64 KB, you actually get 96 KB.
> The above calls are made before connect() on the client side and
> before bind(), accept() on the server side. Bulk data is being sent
> from the client to the server. The client and the server machines also
> have tcp_moderate_rcvbuf set to 0 (though I don't think that's really
> needed; setting a value to SO_RCVBUF should automatically turnoff auto
> tuning.).
>
> However the tcp trace shows the SYN, SYN/ACK and the first few packets as:
> 14:34:18.831703 IP 192.168.1.153.45038 > 192.168.2.204.9999: S
> 3947298186:3947298186(0) win 5840 <mss 1460,sackOK,timestamp 2842625
> 0,nop,wscale 5>
> 14:34:18.836000 IP 192.168.2.204.9999 > 192.168.1.153.45038: S
> 3955381015:3955381015(0) ack 3947298187 win 5792 <mss
> 1460,sackOK,timestamp 2843649 2842625,nop,wscale 2>
> 14:34:18.837654 IP 192.168.1.153.45038 > 192.168.2.204.9999: . ack 1
> win 183 <nop,nop,timestamp 2842634 2843649>
> 14:34:18.837849 IP 192.168.1.153.45038 > 192.168.2.204.9999: .
> 1:1449(1448) ack 1 win 183 <nop,nop,timestamp 2842634 2843649>
> 14:34:18.837851 IP 192.168.1.153.45038 > 192.168.2.204.9999: P
> 1449:1461(12) ack 1 win 183 <nop,nop,timestamp 2842634 2843649>
> 14:34:18.839001 IP 192.168.2.204.9999 > 192.168.1.153.45038: . ack
> 1449 win 2172 <nop,nop,timestamp 2843652 2842634>
> 14:34:18.839011 IP 192.168.2.204.9999 > 192.168.1.153.45038: . ack
> 1461 win 2172 <nop,nop,timestamp 2843652 2842634>
> 14:34:18.840875 IP 192.168.1.153.45038 > 192.168.2.204.9999: .
> 1461:2909(1448) ack 1 win 183 <nop,nop,timestamp 2842637 2843652>
> 14:34:18.840997 IP 192.168.1.153.45038 > 192.168.2.204.9999: .
> 2909:4357(1448) ack 1 win 183 <nop,nop,timestamp 2842637 2843652>
> 14:34:18.841120 IP 192.168.1.153.45038 > 192.168.2.204.9999: .
> 4357:5805(1448) ack 1 win 183 <nop,nop,timestamp 2842637 2843652>
> 14:34:18.841244 IP 192.168.1.153.45038 > 192.168.2.204.9999: .
> 5805:7253(1448) ack 1 win 183 <nop,nop,timestamp 2842637 2843652>
> 14:34:18.841388 IP 192.168.2.204.9999 > 192.168.1.153.45038: . ack
> 2909 win 2896 <nop,nop,timestamp 2843655 2842637>
> 14:34:18.841399 IP 192.168.2.204.9999 > 192.168.1.153.45038: . ack
> 4357 win 3620 <nop,nop,timestamp 2843655 2842637>
> 14:34:18.841413 IP 192.168.2.204.9999 > 192.168.1.153.45038: . ack
> 5805 win 4344 <nop,nop,timestamp 2843655 2842637>
>
> As you can see, the syn and syn ack show rcv windows to be 5840 and
> 5792 and it automatically increases for the receiver to values 2172
> till 4344 and more in the later part of the trace till 24214.
Since the window scale was 2, the final advertised receiver window
you indicate of 24214 gives 2^2*24214 or right around 96 KB, which
is what is expected given the way Linux works.
-Bill
> The values for the tcp sysctl variables are given below:
> /proc/sys/net/ipv4/tcp_moderate_rcvbuf 0
> /proc/sys/net/ipv4/tcp_mem 32768 43690 65536
> /proc/sys/net/ipv4/tcp_rmem 4096 87380 1398080
> /proc/sys/net/ipv4/tcp_wmem 4096 16384 1398080
> /proc/sys/net/core/rmem_max 131071
> /proc/sys/net/core/wmem_max 131071
> /proc/sys/net/core/wmem_default 109568
> /proc/sys/net/core/rmem_default 109568
>
> I will really appreciate your help,
>
> Ritesh
^ permalink raw reply
* Re: [PATCH] net: NEWEMAC: Remove "rgmii-interface" from rgmii matching table
From: Benjamin Herrenschmidt @ 2008-01-16 9:53 UTC (permalink / raw)
To: Stefan Roese; +Cc: linuxppc-dev, netdev
In-Reply-To: <1200476230-14026-1-git-send-email-sr@denx.de>
On Wed, 2008-01-16 at 10:37 +0100, Stefan Roese wrote:
> With the removal the the "rgmii-interface" device_type property from the
> dts files, the newemac driver needs an update to only rely on compatible
> property.
>
> Signed-off-by: Stefan Roese <sr@denx.de>
I need to test if it works on CAB, can't change the DT on those. I'll
let you know tomorrow.
> ---
> drivers/net/ibm_newemac/rgmii.c | 1 -
> 1 files changed, 0 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/net/ibm_newemac/rgmii.c b/drivers/net/ibm_newemac/rgmii.c
> index 9bc1132..5757788 100644
> --- a/drivers/net/ibm_newemac/rgmii.c
> +++ b/drivers/net/ibm_newemac/rgmii.c
> @@ -302,7 +302,6 @@ static int __devexit rgmii_remove(struct of_device *ofdev)
> static struct of_device_id rgmii_match[] =
> {
> {
> - .type = "rgmii-interface",
> .compatible = "ibm,rgmii",
> },
> {
^ permalink raw reply
* [PATCH] ICMP: ICMP_MIB_OUTMSGS increment duplicated
From: Wang Chen @ 2008-01-16 9:59 UTC (permalink / raw)
To: David S. Miller, David L Stevens; +Cc: netdev
In tree net-2.6.25, commit "96793b482540f3a26e2188eaf75cb56b7829d3e3"
made a mistake.
In that patch, David L added a icmp_out_count() in ip_push_pending_frames(),
remove icmp_out_count() from icmp_reply(). But he forgot to remove
icmp_out_count() from icmp_send() too.
Since icmp_send and icmp_reply will call icmp_push_reply, which will call
ip_push_pending_frames, a duplicated increment happened in icmp_send.
This patch remove the icmp_out_count from icmp_send too.
Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
---
diff -Nurp linux-2.6.24.rc8.org/net/ipv4/icmp.c linux-2.6.24.rc8/net/ipv4/icmp.c
--- linux-2.6.24.rc8.org/net/ipv4/icmp.c 2008-01-16 17:45:02.000000000 +0800
+++ linux-2.6.24.rc8/net/ipv4/icmp.c 2008-01-16 17:52:13.000000000 +0800
@@ -540,7 +540,6 @@ void icmp_send(struct sk_buff *skb_in, i
icmp_param.data.icmph.checksum = 0;
icmp_param.skb = skb_in;
icmp_param.offset = skb_network_offset(skb_in);
- icmp_out_count(icmp_param.data.icmph.type);
inet_sk(icmp_socket->sk)->tos = tos;
ipc.addr = iph->saddr;
ipc.opt = &icmp_param.replyopts;
--
WCN
^ permalink raw reply
* Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
From: David Miller @ 2008-01-16 10:29 UTC (permalink / raw)
To: elendil; +Cc: jesse.brandeburg, slavon, netdev, linux-kernel
In-Reply-To: <200801160956.09053.elendil@planet.nl>
From: Frans Pop <elendil@planet.nl>
Date: Wed, 16 Jan 2008 09:56:08 +0100
> On Wednesday 16 January 2008, David Miller wrote:
> > Ok, here is the patch I'll propose to fix this. The goal is to make
> > it as simple as possible without regressing the thing we were trying
> > to fix.
>
> Looks good to me. Tested with -rc8.
Thanks for testing.
^ permalink raw reply
* [PATCH 1/2] [IPV4] fib_hash: fix duplicated route issue
From: Joonwoo Park @ 2008-01-16 11:13 UTC (permalink / raw)
To: David Miller
Cc: Joonwoo Park, linux-netdev, Stephen Hemminger, Alexander Zubkov
http://bugzilla.kernel.org/show_bug.cgi?id=9493
The fib allows making identical routes with 'ip route replace'.
This patch makes the fib return -EEXIST if replacement would cause duplication.
Signed-off-by: Joonwoo Park <joonwpark81@gmail.com>
---
net/ipv4/fib_hash.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/net/ipv4/fib_hash.c b/net/ipv4/fib_hash.c
index 527a6e0..99071d7 100644
--- a/net/ipv4/fib_hash.c
+++ b/net/ipv4/fib_hash.c
@@ -444,6 +444,9 @@ static int fn_hash_insert(struct fib_table *tb, struct fib_config *cfg)
struct fib_info *fi_drop;
u8 state;
+ if (fi->fib_treeref > 1)
+ goto out;
+
write_lock_bh(&fib_hash_lock);
fi_drop = fa->fa_info;
fa->fa_info = fi;
--
1.5.3.rc5
^ permalink raw reply related
* [PATCH 2/2] [IPV4] fib_trie: fix duplicated route issue
From: Joonwoo Park @ 2008-01-16 11:13 UTC (permalink / raw)
To: David Miller
Cc: Joonwoo Park, linux-netdev, Stephen Hemminger, Alexander Zubkov
http://bugzilla.kernel.org/show_bug.cgi?id=9493
The fib allows making identical routes with 'ip route replace'.
This patch makes the fib return -EEXIST if replacement would cause duplication.
Signed-off-by: Joonwoo Park <joonwpark81@gmail.com>
---
net/ipv4/fib_trie.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 8d8c291..1010b46 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1214,6 +1214,9 @@ static int fn_trie_insert(struct fib_table *tb, struct fib_config *cfg)
struct fib_info *fi_drop;
u8 state;
+ if (fi->fib_treeref > 1)
+ goto out;
+
err = -ENOBUFS;
new_fa = kmem_cache_alloc(fn_alias_kmem, GFP_KERNEL);
if (new_fa == NULL)
--
1.5.3.rc5
^ permalink raw reply related
* RE: [PATCH 0/3] UCC TDM driver for MPC83xx platforms
From: Kalra Ashish @ 2008-01-16 11:47 UTC (permalink / raw)
To: Kumar Gala, Andrew Morton
Cc: Phillips Kim, Aggrwal Poonam, sfr, rubini, linux-ppcdev, netdev,
linux-kernel, Barkowski Michael, Cutler Richard, Kalra Ashish,
Suresh PV
In-Reply-To: <5927F7A4-D42A-40F6-AE9C-EDA34738A752@kernel.crashing.org>
Hello All,
I am sure that the TDM bus driver model/framework will make us put a lot
more programming effort without
any assurance of the code being accepted by the Linux community,
especially as there are many
Telephony/VoIP stack implementations in Linux such as the Sangoma
WANPIPE Kernel suite which
have their own Zaptel TDM (channelized zero-copy) interface layer. There
are other High Speed serial (HSS)
API interfaces, again supporting channelized and/or prioritized API
interfaces. All these implementations
are proprietary and have their own tightly coupled upper layers and
hardware abstraction layers. It is
difficult to predict that these stacks will move towards a generic TDM
bus driver interface. Therefore, i think
we can have our own tightly coupled interface with our VoIP framework
and let us the keep the driver as it is,
i.e., as a misc driver.
Ashish
-----Original Message-----
From: Kumar Gala [mailto:galak@kernel.crashing.org]
Sent: Tuesday, January 15, 2008 9:01 AM
To: Andrew Morton
Cc: Phillips Kim; Aggrwal Poonam; sfr@canb.auug.org.au;
rubini@vision.unipv.it; linux-ppcdev@ozlabs.kernel.org;
netdev@vger.kernel.org; linux-kernel@vger.kernel.org; Barkowski Michael;
Kalra Ashish; Cutler Richard
Subject: Re: [PATCH 0/3] UCC TDM driver for MPC83xx platforms
On Jan 14, 2008, at 3:15 PM, Andrew Morton wrote:
> On Mon, 14 Jan 2008 12:00:51 -0600
> Kim Phillips <kim.phillips@freescale.com> wrote:
>
>> On Thu, 10 Jan 2008 21:41:20 -0700
>> "Aggrwal Poonam" <Poonam.Aggrwal@freescale.com> wrote:
>>
>>> Hello All
>>>
>>> I am waiting for more feedback on the patches.
>>>
>>> If there are no objections please consider them for 2.6.25.
>>>
>> if this isn't going to go through Alessandro Rubini/misc drivers, can
>> it go through the akpm/mm tree?
>>
>
> That would work. But it might be more appropriate to go Kumar-
> >paulus->Linus.
I'm ok w/taking the arch/powerpc bits, but I"m a bit concerned about
the driver itself. I'm wondering if we need a TDM framework in the
kernel.
I guess if Poonam could possibly describe how this driver is actually
used that would be helpful. I see we have 8315 with a discrete TDM
block and I'm guessing 82xx/85xx based CPM parts of some form of TDM
as well.
- k
^ permalink raw reply
* Re: [PATCH] ICMP: ICMP_MIB_OUTMSGS increment duplicated
From: Herbert Xu @ 2008-01-16 11:49 UTC (permalink / raw)
To: Wang Chen; +Cc: davem, dlstevens, netdev
In-Reply-To: <478DD57B.6020503@cn.fujitsu.com>
Wang Chen <wangchen@cn.fujitsu.com> wrote:
> In tree net-2.6.25, commit "96793b482540f3a26e2188eaf75cb56b7829d3e3"
> made a mistake.
>
> In that patch, David L added a icmp_out_count() in ip_push_pending_frames(),
> remove icmp_out_count() from icmp_reply(). But he forgot to remove
> icmp_out_count() from icmp_send() too.
> Since icmp_send and icmp_reply will call icmp_push_reply, which will call
> ip_push_pending_frames, a duplicated increment happened in icmp_send.
Actually having the icmp_out_count call in ip_push_pending_frames seems
inconsistent. Having it there means that we count raw socket ICMP packets
too. But we don't do that for any other protocol, e.g., raw UDP packets
don't get counted.
So was the inclusion of raw ICMP packets intentional?
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply
* [PATCH] TUN/TAP GSO/partial csum support
From: Rusty Russell @ 2008-01-16 12:06 UTC (permalink / raw)
To: netdev; +Cc: Herbert Xu, Max Krasnyansky
[-- Attachment #1: Type: text/plain, Size: 9879 bytes --]
OK, revised with help from Herbert. Also, I have attached a test program and
a script to run it (it short-circuits two tun devices, so you can run it with
the patch applied and see big packets flowing).
This implements partial checksum and GSO support for tun/tap.
We use the virtio_net_hdr: it is an ABI already and designed to
encapsulate such metadata as GSO and partial checksums.
lguest performance (160MB sendfile, worst/best/avg, 20 runs):
Before: 5.06/3.39/3.82
After: 4.69/0.84/2.84
Note that there is no easy way to detect if GSO is supported: see next
patch.
Questions:
1) Should we rename/move virtio_net_hdr to something more generic?
2) Is this the right way to build a paged skb from user pages?
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
drivers/net/tun.c | 250 +++++++++++++++++++++++++++++++++++++++++++------
include/linux/if_tun.h | 2
2 files changed, 225 insertions(+), 27 deletions(-)
diff -r ba3c0eb8741a drivers/net/tun.c
--- a/drivers/net/tun.c Wed Jan 16 17:35:25 2008 +1100
+++ b/drivers/net/tun.c Wed Jan 16 22:11:11 2008 +1100
@@ -62,6 +62,7 @@
#include <linux/if_ether.h>
#include <linux/if_tun.h>
#include <linux/crc32.h>
+#include <linux/virtio_net.h>
#include <net/net_namespace.h>
#include <asm/system.h>
@@ -238,35 +239,189 @@ static unsigned int tun_chr_poll(struct
return mask;
}
+static struct sk_buff *copy_user_skb(size_t align, struct iovec *iv, size_t len)
+{
+ struct sk_buff *skb;
+
+ if (!(skb = alloc_skb(len + align, GFP_KERNEL)))
+ return ERR_PTR(-ENOMEM);
+
+ if (align)
+ skb_reserve(skb, align);
+
+ if (memcpy_fromiovec(skb_put(skb, len), iv, len)) {
+ kfree_skb(skb);
+ return ERR_PTR(-EFAULT);
+ }
+ return skb;
+}
+
+/* This will fail if they give us a crazy iovec, but that's their own fault. */
+static int get_user_skb_frags(const struct iovec *iv, size_t count,
+ struct skb_frag_struct *f)
+{
+ unsigned int i, j, num_pg = 0;
+ int err;
+ struct page *pages[MAX_SKB_FRAGS];
+
+ down_read(¤t->mm->mmap_sem);
+ for (i = 0; i < count; i++) {
+ int n, npages;
+ unsigned long base, len;
+ base = (unsigned long)iv[i].iov_base;
+ len = (unsigned long)iv[i].iov_len;
+
+ if (len == 0)
+ continue;
+
+ /* How many pages will this take? */
+ npages = 1 + (base + len - 1)/PAGE_SIZE - base/PAGE_SIZE;
+ if (unlikely(num_pg + npages > MAX_SKB_FRAGS)) {
+ err = -ENOSPC;
+ goto fail;
+ }
+ n = get_user_pages(current, current->mm, base, npages,
+ 0, 0, pages, NULL);
+ if (unlikely(n < 0)) {
+ err = n;
+ goto fail;
+ }
+
+ /* Transfer pages to the frag array */
+ for (j = 0; j < n; j++) {
+ f[num_pg].page = pages[j];
+ if (j == 0) {
+ f[num_pg].page_offset = offset_in_page(base);
+ f[num_pg].size = min(len, PAGE_SIZE -
+ f[num_pg].page_offset);
+ } else {
+ f[num_pg].page_offset = 0;
+ f[num_pg].size = min(len, PAGE_SIZE);
+ }
+ len -= f[num_pg].size;
+ base += f[num_pg].size;
+ num_pg++;
+ }
+
+ if (unlikely(n != npages)) {
+ err = -EFAULT;
+ goto fail;
+ }
+ }
+ up_read(¤t->mm->mmap_sem);
+ return num_pg;
+
+fail:
+ for (i = 0; i < num_pg; i++)
+ put_page(f[i].page);
+ up_read(¤t->mm->mmap_sem);
+ return err;
+}
+
+
+static struct sk_buff *map_user_skb(const struct virtio_net_hdr *gso,
+ size_t align, struct iovec *iv,
+ size_t count, size_t len)
+{
+ struct sk_buff *skb;
+ struct skb_shared_info *sinfo;
+ int err;
+
+ if (!(skb = alloc_skb(gso->gso_hdr_len + align, GFP_KERNEL)))
+ return ERR_PTR(-ENOMEM);
+
+ if (align)
+ skb_reserve(skb, align);
+
+ sinfo = skb_shinfo(skb);
+ sinfo->gso_size = gso->gso_size;
+ sinfo->gso_type = SKB_GSO_DODGY;
+ switch (gso->gso_type) {
+ case VIRTIO_NET_HDR_GSO_TCPV4_ECN:
+ sinfo->gso_type |= SKB_GSO_TCP_ECN;
+ /* fall through */
+ case VIRTIO_NET_HDR_GSO_TCPV4:
+ sinfo->gso_type |= SKB_GSO_TCPV4;
+ break;
+ case VIRTIO_NET_HDR_GSO_TCPV6:
+ sinfo->gso_type |= SKB_GSO_TCPV6;
+ break;
+ case VIRTIO_NET_HDR_GSO_UDP:
+ sinfo->gso_type |= SKB_GSO_UDP;
+ break;
+ default:
+ err = -EINVAL;
+ goto fail;
+ }
+
+ /* Copy in the header. */
+ if (memcpy_fromiovec(skb_put(skb, gso->gso_hdr_len), iv,
+ gso->gso_hdr_len)) {
+ err = -EFAULT;
+ goto fail;
+ }
+
+ err = get_user_skb_frags(iv, count, sinfo->frags);
+ if (err < 0)
+ goto fail;
+
+ sinfo->nr_frags = err;
+ skb->len += len;
+ skb->data_len += len;
+
+ return skb;
+
+fail:
+ kfree_skb(skb);
+ return ERR_PTR(err);
+}
+
+static inline size_t iov_total(const struct iovec *iv, unsigned long count)
+{
+ unsigned long i;
+ size_t len;
+
+ for (i = 0, len = 0; i < count; i++)
+ len += iv[i].iov_len;
+
+ return len;
+}
+
/* Get packet from user space buffer */
-static __inline__ ssize_t tun_get_user(struct tun_struct *tun, struct iovec *iv, size_t count)
+static __inline__ ssize_t tun_get_user(struct tun_struct *tun, struct iovec *iv, size_t num)
{
struct tun_pi pi = { 0, __constant_htons(ETH_P_IP) };
+ struct virtio_net_hdr gso = { 0, VIRTIO_NET_HDR_GSO_NONE };
struct sk_buff *skb;
- size_t len = count, align = 0;
+ size_t tot_len = iov_total(iv, num);
+ size_t len = tot_len, align = 0;
if (!(tun->flags & TUN_NO_PI)) {
- if ((len -= sizeof(pi)) > count)
+ if ((len -= sizeof(pi)) > tot_len)
return -EINVAL;
if(memcpy_fromiovec((void *)&pi, iv, sizeof(pi)))
+ return -EFAULT;
+ }
+ if (tun->flags & TUN_GSO_HDR) {
+ if ((len -= sizeof(gso)) > tot_len)
+ return -EINVAL;
+
+ if (memcpy_fromiovec((void *)&gso, iv, sizeof(gso)))
return -EFAULT;
}
if ((tun->flags & TUN_TYPE_MASK) == TUN_TAP_DEV)
align = NET_IP_ALIGN;
- if (!(skb = alloc_skb(len + align, GFP_KERNEL))) {
+ if (gso.gso_type != VIRTIO_NET_HDR_GSO_NONE)
+ skb = map_user_skb(&gso, align, iv, num, len);
+ else
+ skb = copy_user_skb(align, iv, len);
+
+ if (IS_ERR(skb)) {
tun->dev->stats.rx_dropped++;
- return -ENOMEM;
- }
-
- if (align)
- skb_reserve(skb, align);
- if (memcpy_fromiovec(skb_put(skb, len), iv, len)) {
- tun->dev->stats.rx_dropped++;
- kfree_skb(skb);
- return -EFAULT;
+ return PTR_ERR(skb);
}
switch (tun->flags & TUN_TYPE_MASK) {
@@ -280,7 +435,13 @@ static __inline__ ssize_t tun_get_user(s
break;
};
- if (tun->flags & TUN_NOCHECKSUM)
+ if (gso.flags & (1 << VIRTIO_NET_F_NO_CSUM)) {
+ if (!skb_partial_csum_set(skb,gso.csum_start,gso.csum_offset)) {
+ tun->dev->stats.rx_dropped++;
+ kfree_skb(skb);
+ return -EINVAL;
+ }
+ } else if (tun->flags & TUN_NOCHECKSUM)
skb->ip_summed = CHECKSUM_UNNECESSARY;
netif_rx_ni(skb);
@@ -289,18 +450,7 @@ static __inline__ ssize_t tun_get_user(s
tun->dev->stats.rx_packets++;
tun->dev->stats.rx_bytes += len;
- return count;
-}
-
-static inline size_t iov_total(const struct iovec *iv, unsigned long count)
-{
- unsigned long i;
- size_t len;
-
- for (i = 0, len = 0; i < count; i++)
- len += iv[i].iov_len;
-
- return len;
+ return tot_len;
}
static ssize_t tun_chr_aio_write(struct kiocb *iocb, const struct iovec *iv,
@@ -313,7 +463,7 @@ static ssize_t tun_chr_aio_write(struct
DBG(KERN_INFO "%s: tun_chr_write %ld\n", tun->dev->name, count);
- return tun_get_user(tun, (struct iovec *) iv, iov_total(iv, count));
+ return tun_get_user(tun, (struct iovec *) iv, count);
}
/* Put packet to the user space buffer */
@@ -336,6 +486,42 @@ static __inline__ ssize_t tun_put_user(s
if (memcpy_toiovec(iv, (void *) &pi, sizeof(pi)))
return -EFAULT;
total += sizeof(pi);
+ }
+ if (tun->flags & TUN_GSO_HDR) {
+ struct virtio_net_hdr gso;
+ struct skb_shared_info *sinfo = skb_shinfo(skb);
+
+ if (skb_is_gso(skb)) {
+ gso.gso_hdr_len = skb_transport_header(skb) - skb->data;
+ gso.gso_size = sinfo->gso_size;
+ if (sinfo->gso_type & SKB_GSO_TCP_ECN)
+ gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV4_ECN;
+ else if (sinfo->gso_type & SKB_GSO_TCPV4)
+ gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
+ else if (sinfo->gso_type & SKB_GSO_TCPV6)
+ gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
+ else if (sinfo->gso_type & SKB_GSO_UDP)
+ gso.gso_type = VIRTIO_NET_HDR_GSO_UDP;
+ else
+ BUG();
+ } else
+ gso.gso_type = VIRTIO_NET_HDR_GSO_NONE;
+
+ if (skb->ip_summed == CHECKSUM_PARTIAL) {
+ gso.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+ gso.csum_start = skb->csum_start - skb_headroom(skb);
+ gso.csum_offset = skb->csum_offset;
+ } else {
+ gso.flags = 0;
+ gso.csum_offset = gso.csum_start = 0;
+ }
+
+ if ((len -= sizeof(gso)) < 0)
+ return -EINVAL;
+
+ if (memcpy_toiovec(iv, (void *)&gso, sizeof(gso)))
+ return -EFAULT;
+ total += sizeof(gso);
}
len = min_t(int, skb->len, len);
@@ -523,6 +709,13 @@ static int tun_set_iff(struct file *file
tun_net_init(dev);
+ /* GSO? One of everything, please. */
+ if (ifr->ifr_flags & IFF_GSO_HDR)
+ dev->features = (NETIF_F_SG | NETIF_F_HW_CSUM
+ | NETIF_F_HIGHDMA | NETIF_F_FRAGLIST
+ | NETIF_F_TSO | NETIF_F_UFO
+ | NETIF_F_TSO_ECN | NETIF_F_TSO6);
+
if (strchr(dev->name, '%')) {
err = dev_alloc_name(dev, dev->name);
if (err < 0)
@@ -543,6 +736,9 @@ static int tun_set_iff(struct file *file
if (ifr->ifr_flags & IFF_ONE_QUEUE)
tun->flags |= TUN_ONE_QUEUE;
+
+ if (ifr->ifr_flags & IFF_GSO_HDR)
+ tun->flags |= TUN_GSO_HDR;
file->private_data = tun;
tun->attached = 1;
diff -r ba3c0eb8741a include/linux/if_tun.h
--- a/include/linux/if_tun.h Wed Jan 16 17:35:25 2008 +1100
+++ b/include/linux/if_tun.h Wed Jan 16 22:11:11 2008 +1100
@@ -70,6 +70,7 @@ struct tun_struct {
#define TUN_NO_PI 0x0040
#define TUN_ONE_QUEUE 0x0080
#define TUN_PERSIST 0x0100
+#define TUN_GSO_HDR 0x0200
/* Ioctl defines */
#define TUNSETNOCSUM _IOW('T', 200, int)
@@ -79,6 +80,7 @@ struct tun_struct {
#define IFF_TAP 0x0002
#define IFF_NO_PI 0x1000
#define IFF_ONE_QUEUE 0x2000
+#define IFF_GSO_HDR 0x4000
struct tun_pi {
unsigned short flags;
[-- Attachment #2: tun_gso_pipe.c --]
[-- Type: text/x-csrc, Size: 8976 bytes --]
#include <signal.h>
#include <stddef.h>
#include <errno.h>
#include <sys/socket.h>
#include <sys/ioctl.h>
#include <netinet/ip.h>
#include <netinet/ip_icmp.h>
#include <netinet/udp.h>
#include <netinet/tcp.h>
#include <net/if.h>
#include <net/ethernet.h>
#include <stdio.h>
#include <string.h>
#include <err.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/uio.h>
#include <linux/sockios.h>
#include <linux/if_tun.h>
#include <stdbool.h>
#include <stdint.h>
#include <stddef.h>
typedef uint32_t u32;
typedef uint16_t u16;
typedef uint8_t u8;
#ifndef TUNGETFEATURES
#define TUNGETFEATURES _IOR('T', 207, unsigned int)
#endif
#ifndef IFF_GSO_HDR
#define IFF_GSO_HDR 0x4000
#endif
static bool use_gso = true;
static bool write_all(int fd, const void *data, unsigned long size)
{
while (size) {
int done;
done = write(fd, data, size);
if (done < 0 && errno == EINTR)
continue;
if (done <= 0)
return false;
data += done;
size -= done;
}
return true;
}
static bool read_all(int fd, void *data, unsigned long size)
{
while (size) {
int done;
done = read(fd, data, size);
if (done < 0 && errno == EINTR)
continue;
if (done <= 0)
return false;
data += done;
size -= done;
}
return true;
}
static uint32_t str2ip(const char *ipaddr)
{
unsigned int byte[4];
sscanf(ipaddr, "%u.%u.%u.%u", &byte[0], &byte[1], &byte[2], &byte[3]);
return (byte[0] << 24) | (byte[1] << 16) | (byte[2] << 8) | byte[3];
}
static void configure_device(int fd, const char *devname, uint32_t ipaddr)
{
struct ifreq ifr;
struct sockaddr_in *sin = (struct sockaddr_in *)&ifr.ifr_addr;
/* Don't read these incantations. Just cut & paste them like I did! */
memset(&ifr, 0, sizeof(ifr));
strcpy(ifr.ifr_name, devname);
sin->sin_family = AF_INET;
sin->sin_addr.s_addr = htonl(ipaddr);
if (ioctl(fd, SIOCSIFADDR, &ifr) != 0)
err(1, "Setting %s interface address", devname);
ifr.ifr_flags = IFF_UP;
if (ioctl(fd, SIOCSIFFLAGS, &ifr) != 0)
err(1, "Bringing interface %s up", devname);
}
static int setup_tun_net(uint32_t ip)
{
struct ifreq ifr;
int netfd, ipfd;
unsigned int features;
/* We open the /dev/net/tun device and tell it we want a tap device. A
* tap device is like a tun device, only somehow different. To tell
* the truth, I completely blundered my way through this code, but it
* works now! */
netfd = open("/dev/net/tun", O_RDWR);
if (netfd < 0)
err(1, "Opening /dev/net/tun");
if (use_gso &&
(ioctl(netfd, TUNGETFEATURES, &features) != 0
|| !(features & IFF_GSO_HDR))) {
fprintf(stderr, "No GSO support!\n");
use_gso = false;
}
memset(&ifr, 0, sizeof(ifr));
ifr.ifr_flags = IFF_TAP | IFF_NO_PI | (use_gso ? IFF_GSO_HDR : 0);
strcpy(ifr.ifr_name, "tap%d");
if (ioctl(netfd, TUNSETIFF, &ifr) != 0)
err(1, "configuring /dev/net/tun");
/* We need a socket to perform the magic network ioctls to bring up the
* tap interface, connect to the bridge etc. Any socket will do! */
ipfd = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
if (ipfd < 0)
err(1, "opening IP socket");
/* We are peer 0, ie. first slot, so we hand dev->mem to this routine
* to write the MAC address at the start of the device memory. */
configure_device(ipfd, ifr.ifr_name, ip);
close(ipfd);
return netfd;
}
static void two_way_popen(char *const argv[])
{
int pid;
int pipe1[2], pipe2[2];
if (pipe(pipe1) != 0 || pipe(pipe2) != 0)
err(1, "creating pipe");
pid = fork();
if (pid == -1)
err(1, "forking");
if (pid == 0) {
/* We are the child. */
close(pipe1[1]);
close(pipe2[0]);
dup2(pipe1[0], STDIN_FILENO);
dup2(pipe2[1], STDOUT_FILENO);
execvp(argv[0], argv);
fprintf(stderr, "Failed to exec '%s': %m\n", argv[0]);
kill(getppid(), SIGKILL);
}
/* We are parent. */
close(pipe1[0]);
close(pipe2[1]);
dup2(pipe1[1], STDOUT_FILENO);
dup2(pipe2[0], STDIN_FILENO);
}
struct virtio_net_hdr
{
#define VIRTIO_NET_HDR_F_NEEDS_CSUM 1 // Use csum_start, csum_offset
__u8 flags;
#define VIRTIO_NET_HDR_GSO_NONE 0 // Not a GSO frame
#define VIRTIO_NET_HDR_GSO_TCPV4 1 // GSO frame, IPv4 TCP (TSO)
/* FIXME: Do we need this? If they said they can handle ECN, do they care? */
#define VIRTIO_NET_HDR_GSO_TCPV4_ECN 2 // GSO frame, IPv4 TCP w/ ECN
#define VIRTIO_NET_HDR_GSO_UDP 3 // GSO frame, IPv4 UDP (UFO)
#define VIRTIO_NET_HDR_GSO_TCPV6 4 // GSO frame, IPv6 TCP
__u8 gso_type;
__u16 gso_hdr_len; /* Ethernet + IP + tcp/udp hdrs */
__u16 gso_size; /* Bytes to append to gso_hdr_len per frame */
__u16 csum_start; /* Position to start checksumming from */
__u16 csum_offset; /* Offset after that to place checksum */
};
struct packet
{
struct virtio_net_hdr gso;
struct ether_header mac;
struct iphdr ip;
union {
struct icmphdr icmp;
struct tcphdr tcp;
struct udphdr udp;
char pad[65535 - 34];
};
} __attribute__((packed));
static inline unsigned short from32to16(unsigned long x)
{
/* add up 16-bit and 16-bit for 16+c bit */
x = (x & 0xffff) + (x >> 16);
/* add up carry.. */
x = (x & 0xffff) + (x >> 16);
return x;
}
static unsigned int csum_fold(unsigned int sum)
{
return ~from32to16(sum);
}
static unsigned long do_csum(const unsigned char * buff, int len)
{
int odd, count;
unsigned long result = 0;
if (len <= 0)
return 0;
odd = 1 & (unsigned long) buff;
if (odd) {
result = *buff;
len--;
buff++;
}
count = len >> 1; /* nr of 16-bit words.. */
if (count) {
if (2 & (unsigned long) buff) {
result += *(unsigned short *) buff;
count--;
len -= 2;
buff += 2;
}
count >>= 1; /* nr of 32-bit words.. */
if (count) {
unsigned long carry = 0;
do {
unsigned int w = *(unsigned int *) buff;
count--;
buff += 4;
result += carry;
result += w;
carry = (w > result);
} while (count);
result += carry;
result = (result & 0xffff) + (result >> 16);
}
if (len & 2) {
result += *(unsigned short *) buff;
buff += 2;
}
}
if (len & 1)
result += (*buff << 8);
result = from32to16(result);
if (odd)
result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
return result;
}
static unsigned int csum_partial(const void * buff, int len, unsigned int sum)
{
unsigned int result = do_csum(buff, len);
/* add in old sum, and carry.. */
result += sum;
if (sum > result)
result += 1;
return result;
}
static void csum_replace(__u16 *sum, u32 from, u32 to)
{
u32 diff[] = { ~from, to };
*sum = csum_fold(csum_partial(diff, sizeof(diff), *sum ^ 0xFFFF));
}
#define NIPQUAD(addr) \
((unsigned char *)&addr)[0], \
((unsigned char *)&addr)[1], \
((unsigned char *)&addr)[2], \
((unsigned char *)&addr)[3]
/* Change destination IP address */
static void nat_packet(struct packet *packet, u32 src, u32 dst)
{
u32 oldsrc, olddst;
if (packet->mac.ether_type != htons(ETHERTYPE_IP))
return;
oldsrc = packet->ip.saddr;
olddst = packet->ip.daddr;
packet->ip.saddr = src;
packet->ip.daddr = dst;
csum_replace(&packet->ip.check, oldsrc, src);
csum_replace(&packet->ip.check, olddst, dst);
switch (packet->ip.protocol) {
case IPPROTO_TCP:
csum_replace(&packet->tcp.check, oldsrc, src);
csum_replace(&packet->tcp.check, olddst, dst);
break;
case IPPROTO_UDP:
csum_replace(&packet->udp.check, oldsrc, src);
csum_replace(&packet->udp.check, olddst, dst);
break;
}
}
int main(int argc, char *argv[])
{
int netfd;
__u32 natdst, natsrc;
int size;
struct packet packet;
void *buf;
if (argv[1] && strcmp(argv[1], "--no-gso") == 0) {
argv++;
argc--;
use_gso = false;
}
if (argc < 4)
errx(1, "Usage: %s [--no-gso] ip-addr src-nat-addr dst-nat-addr [command-to-open...]", argv[0]);
netfd = setup_tun_net(str2ip(argv[1]));
natsrc = htonl(str2ip(argv[2]));
natdst = htonl(str2ip(argv[3]));
/* Eg. ssh othermachine /root/tun_gso_pipe 192.168.1.2 192.168.5.2 192.158.5.1 */
if (argc > 4)
two_way_popen(argv+4);
if (use_gso)
buf = &packet;
else
buf = &packet.mac;
for (;;) {
fd_set fds;
FD_ZERO(&fds);
FD_SET(netfd, &fds);
FD_SET(STDIN_FILENO, &fds);
select(netfd+1, &fds, NULL, NULL, NULL);
if (FD_ISSET(netfd, &fds)) {
size = read(netfd, buf, sizeof(packet));
if (size <= 0)
err(1, "Reading netfd");
if (use_gso)
fprintf(stderr, "Read %u, gso = %u/%u\n", size,
packet.gso.gso_type,
packet.gso.gso_size);
nat_packet(&packet, natsrc, natdst);
if (!write_all(STDOUT_FILENO, &size, sizeof(size))
|| !write_all(STDOUT_FILENO, buf, size))
err(1, "Writing data to stdout");
}
if (FD_ISSET(STDIN_FILENO, &fds)) {
int ret;
if (!read_all(STDIN_FILENO, &size, sizeof(size)))
err(1, "Reading stdin");
if (!read_all(STDIN_FILENO, buf, size))
err(1, "Reading %u byte packet", size);
fprintf(stderr, "Writing %u, gso = %u/%u\n", size,
packet.gso.gso_type,
packet.gso.gso_size);
ret = write(netfd, buf, size);
if (ret != size)
err(1, "Writing data to netfd gave %i/%i",
ret, size);
}
}
}
[-- Attachment #3: tun_gso_pipe-setup.sh --]
[-- Type: application/x-shellscript, Size: 794 bytes --]
^ permalink raw reply
* [PATCH] Interface to query tun/tap features.
From: Rusty Russell @ 2008-01-16 12:07 UTC (permalink / raw)
To: netdev; +Cc: Herbert Xu, Max Krasnyansky
In-Reply-To: <200801162306.27767.rusty@rustcorp.com.au>
The problem with introducing IFF_GSO_HDR is that it needs to set dev->features
(to enable GSO, checksumming, etc), which is supposed to be done before
register_netdevice(), ie. as part of TUNSETIFF.
Unfortunately, TUNSETIFF has always just ignored flags it doesn't understand,
so there's no good way of detecting whether the kernel supports IFF_GSO_HDR.
This patch implements a TUNGETFEATURES ioctl which returns all the valid IFF
flags. It could be extended later to include other features.
Here's an example program which uses it:
#include <linux/if_tun.h>
#include <sys/types.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <err.h>
#include <stdio.h>
static struct {
unsigned int flag;
const char *name;
} known_flags[] = {
{ IFF_TUN, "TUN" },
{ IFF_TAP, "TAP" },
{ IFF_NO_PI, "NO_PI" },
{ IFF_ONE_QUEUE, "ONE_QUEUE" },
{ IFF_GSO_HDR, "GSO_HDR" },
};
int main()
{
unsigned int features, i;
int netfd = open("/dev/net/tun", O_RDWR);
if (netfd < 0)
err(1, "Opening /dev/net/tun");
if (ioctl(netfd, TUNGETFEATURES, &features) != 0) {
printf("Kernel does not support TUNGETFEATURES, guessing\n");
features = (IFF_TUN|IFF_TAP|IFF_NO_PI|IFF_ONE_QUEUE);
}
printf("Available features are: ");
for (i = 0; i < sizeof(known_flags)/sizeof(known_flags[0]); i++) {
if (features & known_flags[i].flag) {
features &= ~known_flags[i].flag;
printf("%s ", known_flags[i].name);
}
}
if (features)
printf("(UNKNOWN %#x)", features);
printf("\n");
return 0;
}
---
drivers/net/tun.c | 9 +++++++++
include/linux/if_tun.h | 2 ++
2 files changed, 11 insertions(+)
diff -r ba3c0eb8741a drivers/net/tun.c
--- a/drivers/net/tun.c Wed Jan 16 17:35:25 2008 +1100
+++ b/drivers/net/tun.c Wed Jan 16 22:11:11 2008 +1100
@@ -583,6 +779,15 @@ static int tun_chr_ioctl(struct inode *i
if (copy_to_user(argp, &ifr, sizeof(ifr)))
return -EFAULT;
return 0;
+ }
+
+ if (cmd == TUNGETFEATURES) {
+ /* Currently this just means: "what IFF flags are valid?".
+ * This is needed because we never checked for invalid flags on
+ * TUNSETIFF. This was introduced with IFF_GSO_HDR, so if a
+ * kernel doesn't have this ioctl, it doesn't have GSO header
+ * support. */
+ return put_user(IFF_ALL_FLAGS, (unsigned int __user*)argp);
}
if (!tun)
diff -r ba3c0eb8741a include/linux/if_tun.h
--- a/include/linux/if_tun.h Wed Jan 16 17:35:25 2008 +1100
+++ b/include/linux/if_tun.h Wed Jan 16 22:11:11 2008 +1100
@@ -79,13 +80,15 @@ struct tun_struct {
#define TUNSETOWNER _IOW('T', 204, int)
#define TUNSETLINK _IOW('T', 205, int)
#define TUNSETGROUP _IOW('T', 206, int)
+#define TUNGETFEATURES _IOR('T', 207, unsigned int)
/* TUNSETIFF ifr flags */
#define IFF_TUN 0x0001
#define IFF_TAP 0x0002
#define IFF_NO_PI 0x1000
#define IFF_ONE_QUEUE 0x2000
#define IFF_GSO_HDR 0x4000
+#define IFF_ALL_FLAGS (IFF_TUN|IFF_TAP|IFF_NO_PI|IFF_ONE_QUEUE|IFF_GSO_HDR)
struct tun_pi {
unsigned short flags;
^ permalink raw reply
* Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
From: David Miller @ 2008-01-16 12:25 UTC (permalink / raw)
To: slavon; +Cc: elendil, netdev, linux-kernel
In-Reply-To: <478DC824.4030600@bigtelecom.ru>
From: Badalian Vyacheslav <slavon@bigtelecom.ru>
Date: Wed, 16 Jan 2008 12:02:28 +0300
> applied to 2.6.24-rc7-git2
> Have messages
> Also have regression after apply patch.
> System may do above 800mbs traffic before patch. After its "exit polling
> mode?" (4 CPU, 1 cpu get 100% si (process ksoftirqd/0), 3 CPU is IDLE)
> After patch system was go to "exit polling mode" at above 600mbs.
What do you mean by 'system was go to "exit polling mode"'?
Please be more clear about your situation, in particular
provide every detail about what happens so that we can
properly debug this.
THanks.
^ permalink raw reply
* Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
From: David Miller @ 2008-01-16 12:28 UTC (permalink / raw)
To: slavon; +Cc: elendil, netdev, linux-kernel
In-Reply-To: <478DC824.4030600@bigtelecom.ru>
From: Badalian Vyacheslav <slavon@bigtelecom.ru>
Date: Wed, 16 Jan 2008 12:02:28 +0300
> Also have regression after apply patch.
BTW, if you are using the e1000e driver then this initial
patch will not work.
My more recent patch posting for this problem, will.
I include it again below for you:
[NET]: Fix TX timeout regression in Intel drivers.
This fixes a regression added by changeset
53e52c729cc169db82a6105fac7a166e10c2ec36 ("[NET]: Make ->poll()
breakout consistent in Intel ethernet drivers.")
As pointed out by Jesse Brandeburg, for three of the drivers edited
above there is breakout logic in the *_clean_tx_irq() code to prevent
running TX reclaim forever. If this occurs, we have to elide NAPI
poll completion or else those TX events will never be serviced.
Signed-off-by: David S. Miller <davem@davemloft.net>
diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 13d57b0..0c9a6f7 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -3919,7 +3919,7 @@ e1000_clean(struct napi_struct *napi, int budget)
{
struct e1000_adapter *adapter = container_of(napi, struct e1000_adapter, napi);
struct net_device *poll_dev = adapter->netdev;
- int work_done = 0;
+ int tx_cleaned = 0, work_done = 0;
/* Must NOT use netdev_priv macro here. */
adapter = poll_dev->priv;
@@ -3929,14 +3929,17 @@ e1000_clean(struct napi_struct *napi, int budget)
* simultaneously. A failure obtaining the lock means
* tx_ring[0] is currently being cleaned anyway. */
if (spin_trylock(&adapter->tx_queue_lock)) {
- e1000_clean_tx_irq(adapter,
- &adapter->tx_ring[0]);
+ tx_cleaned = e1000_clean_tx_irq(adapter,
+ &adapter->tx_ring[0]);
spin_unlock(&adapter->tx_queue_lock);
}
adapter->clean_rx(adapter, &adapter->rx_ring[0],
&work_done, budget);
+ if (tx_cleaned)
+ work_done = budget;
+
/* If budget not fully consumed, exit the polling mode */
if (work_done < budget) {
if (likely(adapter->itr_setting & 3))
diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index 4a6fc74..2ab3bfb 100644
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -1384,7 +1384,7 @@ static int e1000_clean(struct napi_struct *napi, int budget)
{
struct e1000_adapter *adapter = container_of(napi, struct e1000_adapter, napi);
struct net_device *poll_dev = adapter->netdev;
- int work_done = 0;
+ int tx_cleaned = 0, work_done = 0;
/* Must NOT use netdev_priv macro here. */
adapter = poll_dev->priv;
@@ -1394,12 +1394,15 @@ static int e1000_clean(struct napi_struct *napi, int budget)
* simultaneously. A failure obtaining the lock means
* tx_ring is currently being cleaned anyway. */
if (spin_trylock(&adapter->tx_queue_lock)) {
- e1000_clean_tx_irq(adapter);
+ tx_cleaned = e1000_clean_tx_irq(adapter);
spin_unlock(&adapter->tx_queue_lock);
}
adapter->clean_rx(adapter, &work_done, budget);
+ if (tx_cleaned)
+ work_done = budget;
+
/* If budget not fully consumed, exit the polling mode */
if (work_done < budget) {
if (adapter->itr_setting & 3)
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index a564916..de3f45e 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -1468,13 +1468,16 @@ static int ixgbe_clean(struct napi_struct *napi, int budget)
struct ixgbe_adapter *adapter = container_of(napi,
struct ixgbe_adapter, napi);
struct net_device *netdev = adapter->netdev;
- int work_done = 0;
+ int tx_cleaned = 0, work_done = 0;
/* In non-MSIX case, there is no multi-Tx/Rx queue */
- ixgbe_clean_tx_irq(adapter, adapter->tx_ring);
+ tx_cleaned = ixgbe_clean_tx_irq(adapter, adapter->tx_ring);
ixgbe_clean_rx_irq(adapter, &adapter->rx_ring[0], &work_done,
budget);
+ if (tx_cleaned)
+ work_done = budget;
+
/* If budget not fully consumed, exit the polling mode */
if (work_done < budget) {
netif_rx_complete(netdev, napi);
^ permalink raw reply related
* Re: memory leakage in bridge(kernel-2.6.23.14)
From: David Miller @ 2008-01-16 12:31 UTC (permalink / raw)
To: wyb; +Cc: linux-net, netdev
In-Reply-To: <200801161006.AZZ53966@topsec.com.cn>
From: <wyb@topsec.com.cn>
Date: Wed, 16 Jan 2008 18:04:53 +0800
> In SMP, if a bridge fdb is being created when another CPU at the same time
> delete the bridge, this newly created fdb may incur a leakage:
netdev@vger.kernel.org (CC:'d) is the proper place to report
things like this.
'linux-net' is only for general user questions about the networking,
not for bug reports, patch postings, or developer discussion. The
'netdev' list is for that.
Thank you.
>
> CPU0:
>
> static void del_nbp(struct net_bridge_port *p)
> {
> /*
> * CPU1 enter br_fdb_update(), bridge port is still valid.
> */
> ......
> spin_lock_bh(&br->lock);
> br_stp_disable_port(p);
> spin_unlock_bh(&br->lock);
>
> br_ifinfo_notify(RTM_DELLINK, p);
>
> br_fdb_delete_by_port(br, p, 1);
>
> /*
> * CPU1 call fdb_create() for the being deleted bridge,
> * a fdb would be add to bridge's fdb hash table, and will never
> * be freed. because when deleting a bridge, linux flush fdb for
> each
> * bridge port, but this newly created fdb belong to no bridge port
> */
> ......
> }
>
> To fix this, fdb_create() should be changed to:
> {
> struct net_bridge_fdb_entry *fdb;
>
> /*
> * if the bridge port is deleted, then return.
> */
> if (!(source->state == BR_STATE_LEARNING ||
> source->state == BR_STATE_FORWARDING))
> return;
>
> fdb = kmem_cache_alloc(br_fdb_cache, GFP_ATOMIC);
> if (fdb) {
> memcpy(fdb->addr.addr, addr, ETH_ALEN);
> atomic_set(&fdb->use_count, 1);
> hlist_add_head_rcu(&fdb->hlist, head);
>
> fdb->dst = source;
> fdb->is_local = is_local;
> fdb->is_static = is_local;
> fdb->ageing_timer = jiffies;
> }
> return fdb;
> }
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-net" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 4/4] bonding: Fix some RTNL taking
From: Jarek Poplawski @ 2008-01-16 12:44 UTC (permalink / raw)
To: Makito SHIOKAWA; +Cc: netdev
In-Reply-To: <20080115063650.236883000@miraclelinux.com>
On 15-01-2008 07:36, Makito SHIOKAWA wrote:
> Fix some RTNL lock taking:
>
> * RTNL (mutex; may sleep) must not be taken under read_lock (spinlock; must be
> atomic). However, RTNL is taken under read_lock in bond_loadbalance_arp_mon()
> and bond_activebackup_arp_mon(). So change code to take RTNL outside of read_lock.
>
> * rtnl_unlock() calls netdev_run_todo() which takes net_todo_run_mutex, and
> rtnl_unlock() is called under read_lock in bond_mii_monitor(). So for the same
> reason as above, change code to call rtnl_unlock() outside of read_lock.
>
> Signed-off-by: Makito SHIOKAWA <mshiokawa@miraclelinux.com>
> ---
> drivers/net/bonding/bond_main.c | 24 ++++++++++--------------
> 1 file changed, 10 insertions(+), 14 deletions(-)
>
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -2372,6 +2372,7 @@ void bond_mii_monitor(struct work_struct
> struct bonding *bond = container_of(work, struct bonding,
> mii_work.work);
> unsigned long delay;
> + int need_unlock = 0;
>
> read_lock(&bond->lock);
> if (bond->kill_timers) {
> @@ -2383,13 +2384,16 @@ void bond_mii_monitor(struct work_struct
> rtnl_lock();
> read_lock(&bond->lock);
> __bond_mii_monitor(bond, 1);
> - rtnl_unlock();
> + need_unlock = 1;
Maybe I'm wrong, but since this read_lock() is given and taken anyway,
it seems this looks a bit better to me (why hold this rtnl longer
than needed?):
read_unlock(&bond->lock);
rtnl_unlock();
read_lock(&bond->lock);
On the other hand, probably 'if (bond->kill_timers)' could be repeated
after this read_lock() retaking.
> }
>
> delay = ((bond->params.miimon * HZ) / 1000) ? : 1;
> - read_unlock(&bond->lock);
> if (bond->params.miimon)
> queue_delayed_work(bond->wq, &bond->mii_work, delay);
If this if () is really necessary here, then this should be better
before "delay = ..." with a block.
> + read_unlock(&bond->lock);
> + /* rtnl_unlock() may sleep, so call it after read_unlock() */
> + if (need_unlock)
> + rtnl_unlock();
> }
Regards,
Jarek P.
^ permalink raw reply
* Re: [PATCH 2.6.23+] ingress classify to [nf]mark
From: jamal @ 2008-01-16 12:45 UTC (permalink / raw)
To: mahatma; +Cc: netdev
In-Reply-To: <478BE021.1070408@bspu.unibel.by>
On Mon, 2008-14-01 at 20:20 -0200, Dzianis Kahanovich wrote:
> jamal wrote:
[..]
> > Did that make sense?
>
> After current "#endif" - may be.
I am afraid that would be counter to expected behavior.
Default is meant to apply when no value has been defined. Mark of 0 for
example doesnt mean "default". Let me demonstrate with the ifdefs again
with some arbitrary example:
-----------------
#ifdef CONFIG_NET_CLS_ACT
..classify ...
.. action 1 sets mark to 0x11111
.. action 2 checks some state and conditionally let action 3 execute
.. action 3 sets mark to 0
if OK is returned set tc_index based on classid
#else // no actions compiled
..classify
.... jamal suggests: set default mark and tc_index for ingress here
#endif
// mahatma wants to set default for mark and tcindex here
// so it works for both actions and none-action code
------------------------
Lets look at the case of actions compiled in:
I have defined my policies (in user space) so that the mark can be set
to either 0 or 0x1111 depending on some runtime state.
Your default (kernel) code is now going to overide my policy - which is
bad. Even in the case of OK being returned, it is wrong to set tc_index;
unfortunately, we dont have an action that can set tc_index today; if we
did, we would need to remove that setting.
You other intent was to set the value of mark based on the value of
classid. You _can do that today already_ with no changes via a policy in
user space. You suggested to do an ifdef so you wont have to type in the
line which says how to mark, and i said that was a bad idea (we need
less ifdefs not more).
For the case of no actions compiled in:
nothing can write into the values of either tcindex or mark after
classification (on ingress), so it is ok to override. If you did this
for egress as well, that would be wrong because it is expected that some
qdiscs may set or utilize these metadatum.
I am not sure if it made more sense this time?
> What "result" are with:
> 1) no filters?
> 2) 1 filter only, with "action continue"?
Please refer to above verbosity and see if it all makes better sense.
cheers,
jamal
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox