* [PROBLEM] linux-2.6.36-rc5 crash with gianfar ethernet at full line rate traffic
From: emin ak @ 2010-10-03 6:20 UTC (permalink / raw)
To: netdev; +Cc: David Miller, Kumar Gala
In-Reply-To: <AANLkTi=Kvi3u5bRp5DtRH-Pr6ALew60cPgeVEZ8V-Dnu@mail.gmail.com>
Hi all,
My problem is kernel crash under full line rate random packet length
ip network traffic.
I'am using default unmodified kernel and default SMP kernel
configuration, MPC8572DS development board and also using a hardware
packet generator.
My test is ip forwarding between eth0 and eth1, and Hardware packet
generator produces full duplex, full line rate traffic with random
packet length and random payload . After a few millions of packets
passed, kernel produces this bellow two different crash messages . I
have retry this scenario many times, crash occurs sometimes on
skb_put, but mostly occurs on ip_rcv function. I have aplied same
test to latest stable linux 2.6.35.6 kernel. Same errors produced.
Any comment and help are appreciated.
Here is crash logs:
Thanks.
Emin
First type of crash:
root@mpc8572ds:~# skb_over_panic: text:c0226280 len:1171 put:1171
head:eed6d000 data:eed63040 tail:0xeed6d4d3 end:0xeed63660 dev:<NULL>
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:127!
Oops: Exception in kernel mode, sig: 5 [#1]
SMP NR_CPUS=2 MPC8572 DS
last sysfs file: /sys/devices/pci0002:03/0002:03:00.0/subsystem_device
Modules linked in:
NIP: c023bdcc LR: c023bdcc CTR: c01f3ff8
REGS: effe7d70 TRAP: 0700 Not tainted (2.6.36-rc5)
MSR: 00029000 <EE,ME,CE> CR: 22028024 XER: 20000000
TASK = ef83e9a0[9] 'ksoftirqd/1' THREAD: ef856000 CPU: 1
GPR00: c023bdcc effe7e20 ef83e9a0 0000007c 00021000 ffffffff c01f7b98 c03ccf1c
GPR08: c03c69d4 c03f94b4 00c4e000 00000004 20028048 1001a108 ef211000 efb52d90
GPR16: efb52e38 efb52870 00000000 ef211800 00000008 00000009 efb52800 00000037
GPR24: ef24e180 ef2be040 00000000 ef211948 efb52b80 00000493 ef015940 ef386600
NIP [c023bdcc] skb_put+0x8c/0x94
LR [c023bdcc] skb_put+0x8c/0x94
Call Trace:
[effe7e20] [c023bdcc] skb_put+0x8c/0x94 (unreliable)
[effe7e30] [c0226280] gfar_clean_rx_ring+0x104/0x4b8
[effe7e90] [c02269dc] gfar_poll+0x3a8/0x60c
[effe7f60] [c024928c] net_rx_action+0xf8/0x1a4
[effe7fa0] [c0042524] __do_softirq+0xe0/0x178
[effe7ff0] [c000e59c] call_do_softirq+0x14/0x24
[ef857f50] [c0004840] do_softirq+0x90/0xa0
[ef857f70] [c00430e4] run_ksoftirqd+0xb4/0x164
[ef857fb0] [c00586b4] kthread+0x7c/0x80
[ef857ff0] [c000e9a8] kernel_thread+0x4c/0x68
Instruction dump:
81030098 2f800000 409e000c 3d20c037 3809a19c 3c60c037 7c8802a6 7d695b78
3863b010 90010008 4cc63182 4be016c5 <0fe00000> 48000000 9421fff0 7c0802a6
Kernel panic - not syncing: Fatal exception in interrupt
---------------
second type of crash:
Faulting instruction address: 0xc026c1dc
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2 MPC8572 DS
last sysfs file: /sys/devices/pci0002:03/0002:03:00.0/subsystem_device
Modules linked in:
NIP: c026c1dc LR: c026bfac CTR: 00000000
REGS: effebd00 TRAP: 0300 Not tainted (2.6.36-rc5)
MSR: 00029000 <EE,ME,CE> CR: 42028042 XER: 00000000
DEAR: 0000cad8, ESR: 00000000
TASK = ef83cde0[3] 'ksoftirqd/0' THREAD: ef84a000 CPU: 0
GPR00: 00000005 effebdb0 ef83cde0 00000000 000001b9 00000000 c1008060 00000000
GPR08: 02c3f605 0000ca00 000005b9 0000ca00 b653a6c7 7af823f0 ef217000 efbab590
GPR16: efbab638 efbab070 00000000 ef217800 00000008 00000018 efbab000 00000028
GPR24: c03f971c c0410000 c0400000 c03f94b4 effea000 ef316e40 00000000 eecb685e
NIP [c026c1dc] ip_rcv+0x3f8/0x808
LR [c026bfac] ip_rcv+0x1c8/0x808
Call Trace:
[effebdb0] [c026c204] ip_rcv+0x420/0x808 (unreliable)
[effebde0] [c02482dc] __netif_receive_skb+0x2f8/0x324
[effebe10] [c02483a4] netif_receive_skb+0x9c/0xb0
[effebe30] [c0226308] gfar_clean_rx_ring+0x18c/0x4b8
[effebe90] [c02269dc] gfar_poll+0x3a8/0x60c
[effebf60] [c024928c] net_rx_action+0xf8/0x1a4
[effebfa0] [c0042524] __do_softirq+0xe0/0x178
[effebff0] [c000e59c] call_do_softirq+0x14/0x24
[ef84bf50] [c0004840] do_softirq+0x90/0xa0
[ef84bf70] [c00430e4] run_ksoftirqd+0xb4/0x164
[ef84bfb0] [c00586b4] kthread+0x7c/0x80
[ef84bff0] [c000e9a8] kernel_thread+0x4c/0x68
Instruction dump:
8148003c 318a0001 7d690194 91680038 9188003c 4bfffd78 7fa3eb78 48002a29
2f830000 40beff50 817d0048 5569003c <a00900d8> 2f800005 419e0034 2f800003
Kernel panic - not syncing: Fatal exception in interrupt
^ permalink raw reply
* Re: To GRO or not to GRO...
From: "Oleg A. Arkhangelsky" @ 2010-10-03 7:18 UTC (permalink / raw)
To: Richard Scobie; +Cc: netdev
In-Reply-To: <4CA7BD60.3010501@sauce.co.nz>
03.10.2010, 03:17, "Richard Scobie" <richard@sauce.co.nz>:
> In the README for ixgbe-2.1.4, is says:
>
> " Disable GRO when routing/bridging
> ---------------------------------
> Due to a known kernel issue, GRO must be turned off when
> routing/bridging.
> GRO can be turned off via ethtool."
>
This information is true for LRO, not GRO.
I believe that this appeared when ixgbe was converted from LRO
to GRO and all occurrences of LRO was blindly replaced by GRO
everywhere.
--
wbr, Oleg.
^ permalink raw reply
* Re: To GRO or not to GRO...
From: David Miller @ 2010-10-03 7:31 UTC (permalink / raw)
To: sysoleg; +Cc: richard, netdev
In-Reply-To: <152341286090326@web67.yandex.ru>
From: "\"Oleg A. Arkhangelsky\"" <sysoleg@yandex.ru>
Date: Sun, 03 Oct 2010 11:18:46 +0400
>
>
> 03.10.2010, 03:17, "Richard Scobie" <richard@sauce.co.nz>:
>
>> In the README for ixgbe-2.1.4, is says:
>>
>> " Disable GRO when routing/bridging
>> ---------------------------------
>> Due to a known kernel issue, GRO must be turned off when
>> routing/bridging.
>> GRO can be turned off via ethtool."
>>
>
> This information is true for LRO, not GRO.
Right.
> I believe that this appeared when ixgbe was converted from LRO
> to GRO and all occurrences of LRO was blindly replaced by GRO
> everywhere.
Someome please submit a patch to fix this, thanks.
^ permalink raw reply
* Re: sysctl_{tcp,udp,sctp}_mem overflow on 16TB system.
From: Maciej Żenczykowski @ 2010-10-03 8:20 UTC (permalink / raw)
To: Willy Tarreau
Cc: Robin Holt, David S. Miller, Alexey Kuznetsov,
Pekka Savola (ipv6), James Morris, Hideaki YOSHIFUJI,
Patrick McHardy, Vlad Yasevich, Sridhar Samudrala, linux-kernel,
netdev, linux-decnet-user, linux-sctp
In-Reply-To: <20101001203022.GA28486@1wt.eu>
Isn't INT_MAX/2 just 1GB, which is only ~0.9 seconds at 10 Gbps?
^ permalink raw reply
* Fwd: [multipathtcp] Call for contribution to middlebox survey
From: Alexander Zimmermann @ 2010-10-03 9:28 UTC (permalink / raw)
To: Netdev
In-Reply-To: <985BFFF5-B9DB-4F68-8837-24E434FD08AD@sfc.wide.ad.jp>
[-- Attachment #1: Type: text/plain, Size: 5194 bytes --]
Hi folks,
the Michio Honda from IETF Multipath TCP WG needs some help...
Alex
Anfang der weitergeleiteten E-Mail:
> Von: Michio Honda <micchie@sfc.wide.ad.jp>
> Datum: 3. Oktober 2010 01:30:57 MESZ
> An: Multipath TCP Mailing List <multipathtcp@ietf.org>, <tcpm@ietf.org>
> Kopie: Mark Handley <m.handley@cs.ucl.ac.uk>
> Betreff: [multipathtcp] Call for contribution to middlebox survey
>
> Hi,
>
> We are surveying middleboxes affecting TCP in the Internet, and we'd like you to contribute to this work by running 1 python script at your available networks, because we want data of as many paths as possible.
> This script generates test TCP traffic to a server node, and detects various middlebox behavior, for example, it detects how unknown TCP options are treated and if sequence number is rewritten.
>
> - Overview of script
> This generates test TCP traffic by using raw socket or pcap.
> Destinations of the test traffic are port 80, 443 and 34343 on vinson3.sfc.wide.ad.jp, which is located in Japan.
> The total amount of test traffic is approximately 90 connections (not parallel), and each of them uses approximately maximum 2048Byte.
>
> - System requirement
> Our script works on Mac OSX 10.5 or 10.6, Linux (kernel 2.6) and FreeBSD (7.0 or higher). This also requires python 2.5 or higher, and libpcap
> NOTE. if you try in a virtual machine on Windows, please connect the guest OS via not NAT but bridge.
>
> How to run experiment is described below per-OS basis.
>
> After the experiment, you will find 3 log files (logxxxxxxxxx.txt) in the same directory as the experiment.
> Please send them to us (micchie@sfc.wide.ad.jp) and tell me your network information as much as you know (e.g., product name of the broadband router, ISP name, product name of firewall appliance etc...)
> In addition, let us know if you have hesitation to open these information.
> This experiment doesn't collect traffic information other than those our script generated.
>
> ***** How to run the experiment (Mac OSX) *****
>
> 1. Filtering RST TCP segment from OS
> Execute a following command by root:
> ipfw add 101 deny tcp from any to vinson3.sfc.wide.ad.jp dst-port 34343,80,443 tcpflags rst
>
> NOTE: if you are already running ipfw, please add equivalent rules
> After the experiment, you can revert by "ipfw delete 101"
>
> 2. Executing script
> Download script from http://www.micchie.net/software/tcpexposure/for_distrib.tar.gz, and decompress it to anywhere you like (e.g., tar xzf for_distrib.tar.gz by command line)
>
> In the for_distrib directory, execute a following command by root:
> sh run-bsd2.sh
> (This will take approximately 30 min.)
>
>
> ***** How to run the experiment (Linux) *****
>
> 1. Filtering RST TCP segment from OS
> Execute following command by root:
> /sbin/iptables -A OUTPUT -p tcp -d vinson3.sfc.wide.ad.jp --tcp-flags RST RST -m multiport --dports 34343,80,443 -j DROP
>
> NOTE: if you are already running iptables, please add equivalent rules
> After the experiment, you can revert by opposite commands - using -D instead of -A
>
> 2. Executing script
> Download script from http://www.micchie.net/software/tcpexposure/for_distrib.tar.gz, and decompress it to anywhere you like (e.g., tar xzf for_distrib.tar.gz)
>
> In the for_distrib directory, execute a following command by root:
> sh run-linux2.sh
> (This will take approximately 30 min.)
>
>
> ***** How to run the script (FreeBSD) *****
>
> 1. Filtering RST TCP segment from OS
> If you are using neither ipfw nor pf:
> Load pf kernel module with a following command by root:
> kldload /boot/kernel/pf.ko
>
> Add following 2 lines to /etc/pf.conf (please replace IFNAME to your outgoing interface name (e.g., em0):
> pass out all
> block out quick on IFNAME proto tcp to vinson3.sfc.wide.ad.jp port {34343,80,443} flags R/R
>
> Execute following command by root:
> pfctl -e -f /etc/pf.conf
>
> If you are already running pf, please add equivalent rules
> After the experiment, you can revert settings by cleaning up /etc/pf.conf and executing "pfctl -d" by root
>
> If you are already using ipfw:
> Please add a following rule to ipfw configuration:
> deny tcp from any to vinson3.sfc.wide.ad.jp dst-port 34343,80,443 tcpflags rst
>
> 2. Executing script
> Download script from http://www.micchie.net/software/tcpexposure/for_distrib.tar.gz, and decompress it to anywhere you like (e.g., tar xzf for_distrib.tar.gz)
>
> In the for_distrib directory, execute a following command by root:
> sh run-bsd2.sh
> (This will take approximately 30 min.)
>
>
> Best regards,
> - Michio
>
> _______________________________________________
> multipathtcp mailing list
> multipathtcp@ietf.org
> https://www.ietf.org/mailman/listinfo/multipathtcp
//
// Dipl.-Inform. Alexander Zimmermann
// Department of Computer Science, Informatik 4
// RWTH Aachen University
// Ahornstr. 55, 52056 Aachen, Germany
// phone: (49-241) 80-21422, fax: (49-241) 80-22222
// email: zimmermann@cs.rwth-aachen.de
// web: http://www.umic-mesh.net
//
[-- Attachment #2: Signierter Teil der Nachricht --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply
* Re: [PATCH net-next V3] net: dynamic ingress_queue allocation
From: Jarek Poplawski @ 2010-10-03 9:42 UTC (permalink / raw)
To: Eric Dumazet; +Cc: hadi, David Miller, netdev
In-Reply-To: <1286035915.2582.2472.camel@edumazet-laptop>
On Sat, Oct 02, 2010 at 06:11:55PM +0200, Eric Dumazet wrote:
> Le samedi 02 octobre 2010 ?? 11:32 +0200, Jarek Poplawski a écrit :
> > On Fri, Oct 01, 2010 at 03:56:28PM +0200, Eric Dumazet wrote:
...
> > > diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> > > index b802078..8635110 100644
> > > --- a/net/sched/sch_api.c
> > > +++ b/net/sched/sch_api.c
...
> > > @@ -690,6 +693,8 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
> > > (new && new->flags & TCQ_F_INGRESS)) {
> > > num_q = 1;
> > > ingress = 1;
> > > + if (!dev_ingress_queue(dev))
> > > + return -ENOENT;
> >
> > Is this test really needed here?
>
> To avoid a NULL dereference some lines later.
> Do I have a guarantee its not NULL here ?
Do you have any scenario for NULL here? ;-)
Of course, it's your patch and responsibility, and I'll not guarantee,
but you could at least add a TODO comment, to check it later.
> > > @@ -1044,7 +1050,8 @@ replay:
> > > return -ENOENT;
> > > q = qdisc_leaf(p, clid);
> > > } else { /*ingress */
> > > - q = dev->ingress_queue.qdisc_sleeping;
> > > + if (dev_ingress_queue_create(dev))
> > > + q = dev_ingress_queue(dev)->qdisc_sleeping;
> >
> > I wonder if doing dev_ingress_queue_create() just before qdisc_create()
> > (and the test here) isn't more readable.
>
> Sorry, I dont understand. I want to create ingress_queue only if user
> wants it. If we setup (egress) trafic shaping, no need to setup
> ingress_queue.
I mean doing both creates in one place:
> @@ -1123,11 +1130,14 @@ replay:
> create_n_graft:
...
> + if (clid == TC_H_INGRESS) {
+ if (dev_ingress_queue_create(dev))
> + q = qdisc_create(dev, dev_ingress_queue(dev), p,
> + tcm->tcm_parent, tcm->tcm_parent,
> + tca, &err);
> + else
> + err = -ENOENT;
> + } else {
> struct netdev_queue *dev_queue;
...
> Here is the V3 then.
>
> [PATCH net-next V3] net: dynamic ingress_queue allocation
>
> ingress being not used very much, and net_device->ingress_queue being
> quite a big object (128 or 256 bytes), use a dynamic allocation if
> needed (tc qdisc add dev eth0 ingress ...)
>
> dev_ingress_queue(dev) helper should be used only with RTNL taken.
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
> V3: add rcu notations & address Jarek comments
> include/linux/netdevice.h | 2 -
> include/linux/rtnetlink.h | 8 ++++++
> net/core/dev.c | 34 ++++++++++++++++++++++-------
> net/sched/sch_api.c | 42 ++++++++++++++++++++++++------------
> net/sched/sch_generic.c | 12 ++++++----
> 5 files changed, 71 insertions(+), 27 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index ceed347..92d81ed 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -986,7 +986,7 @@ struct net_device {
> rx_handler_func_t *rx_handler;
> void *rx_handler_data;
>
> - struct netdev_queue ingress_queue; /* use two cache lines */
> + struct netdev_queue __rcu *ingress_queue;
>
> /*
> * Cache lines mostly used on transmit path
> diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
> index 68c436b..0bb7b48 100644
> --- a/include/linux/rtnetlink.h
> +++ b/include/linux/rtnetlink.h
> @@ -6,6 +6,7 @@
> #include <linux/if_link.h>
> #include <linux/if_addr.h>
> #include <linux/neighbour.h>
> +#include <linux/netdevice.h>
>
> /* rtnetlink families. Values up to 127 are reserved for real address
> * families, values above 128 may be used arbitrarily.
> @@ -769,6 +770,13 @@ extern int lockdep_rtnl_is_held(void);
> #define rtnl_dereference(p) \
> rcu_dereference_check(p, lockdep_rtnl_is_held())
>
> +static inline struct netdev_queue *dev_ingress_queue(struct net_device *dev)
> +{
> + return rtnl_dereference(dev->ingress_queue);
I'd consider rcu_dereference_rtnl(). Btw, technically qdisc_lookup()
doesn't require rtnl, and there was time it was used without it
(on xmit path).
I think you should also add a comment here why this rcu is used, and
that it changes only once in dev's liftime.
Jarek P.
PS: checkpatched or not checkpatched, that is the question... ;-)
^ permalink raw reply
* [PATCH] net: Fix the condition passed to sk_wait_event()
From: Nagendra Tomar @ 2010-10-03 9:45 UTC (permalink / raw)
To: netdev; +Cc: linux-kernel, davem
Resending, since this is the only patch now. Thanks.
---
This patch fixes the condition (3rd arg) passed to sk_wait_event() in
sk_stream_wait_memory(). The incorrect check in sk_stream_wait_memory()
causes the following soft lockup in tcp_sendmsg() when the global tcp
memory pool has exhausted.
>>> snip <<<
localhost kernel: BUG: soft lockup - CPU#3 stuck for 11s! [sshd:6429]
localhost kernel: CPU 3:
localhost kernel: RIP: 0010:[sk_stream_wait_memory+0xcd/0x200] [sk_stream_wait_memory+0xcd/0x200] sk_stream_wait_memory+0xcd/0x200
localhost kernel:
localhost kernel: Call Trace:
localhost kernel: [sk_stream_wait_memory+0x1b1/0x200] sk_stream_wait_memory+0x1b1/0x200
localhost kernel: [<ffffffff802557c0>] autoremove_wake_function+0x0/0x40
localhost kernel: [ipv6:tcp_sendmsg+0x6e6/0xe90] tcp_sendmsg+0x6e6/0xce0
localhost kernel: [sock_aio_write+0x126/0x140] sock_aio_write+0x126/0x140
localhost kernel: [xfs:do_sync_write+0xf1/0x130] do_sync_write+0xf1/0x130
localhost kernel: [<ffffffff802557c0>] autoremove_wake_function+0x0/0x40
localhost kernel: [hrtimer_start+0xe3/0x170] hrtimer_start+0xe3/0x170
localhost kernel: [vfs_write+0x185/0x190] vfs_write+0x185/0x190
localhost kernel: [sys_write+0x50/0x90] sys_write+0x50/0x90
localhost kernel: [system_call+0x7e/0x83] system_call+0x7e/0x83
>>> snip <<<
What is happening is, that the sk_wait_event() condition passed from
sk_stream_wait_memory() evaluates to true for the case of tcp global memory
exhaustion. This is because both sk_stream_memory_free() and vm_wait are true
which causes sk_wait_event() to *not* call schedule_timeout().
Hence sk_stream_wait_memory() returns immediately to the caller w/o sleeping.
This causes the caller to again try allocation, which again fails and again
calls sk_stream_wait_memory(), and so on.
Signed-off-by: Nagendra Singh Tomar <tomer_iisc@yahoo.com>
---
--- linux-2.6.35.7/net/core/stream.c.orig 2010-03-25 07:37:58.000000000 +0530
+++ linux-2.6.35.7/net/core/stream.c 2010-03-25 07:42:16.000000000 +0530
@@ -144,10 +144,10 @@ int sk_stream_wait_memory(struct sock *s
set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
sk->sk_write_pending++;
- sk_wait_event(sk, ¤t_timeo, !sk->sk_err &&
- !(sk->sk_shutdown & SEND_SHUTDOWN) &&
- sk_stream_memory_free(sk) &&
- vm_wait);
+ sk_wait_event(sk, ¤t_timeo, sk->sk_err ||
+ (sk->sk_shutdown & SEND_SHUTDOWN) ||
+ (sk_stream_memory_free(sk) &&
+ !vm_wait));
sk->sk_write_pending--;
if (vm_wait) {
---
^ permalink raw reply
* Re: [Patch] Limit sysctl_tcp_mem and sysctl_udp_mem initializers to prevent integer overflows.
From: Robin Holt @ 2010-10-03 11:16 UTC (permalink / raw)
To: Eric Dumazet
Cc: Robin Holt, Andrew Morton, Willy Tarreau, linux-kernel, netdev,
David S. Miller, Alexey Kuznetsov, Pekka Savola (ipv6),
James Morris, Hideaki YOSHIFUJI, Patrick McHardy
In-Reply-To: <1286025736.2582.1827.camel@edumazet-laptop>
On Sat, Oct 02, 2010 at 03:22:16PM +0200, Eric Dumazet wrote:
> Le samedi 02 octobre 2010 à 06:24 -0500, Robin Holt a écrit :
...
> Strange, you mention sctp in changelog but I cant see the patch.
After looking at the patch, I realized it really belonged in a separate
change and sent that to the sctp mailing list without noticing I forgot
to Cc: lkml.
> We can switch infrastructure to use long "instead" of "int", now
> atomic_long_t primitives are available for free.
>
> Reported-by: Robin Holt <holt@sgi.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Reviewed-by: Robin Holt <holt@sgi.com>
^ permalink raw reply
* Re: sysctl_{tcp,udp,sctp}_mem overflow on 16TB system.
From: Robin Holt @ 2010-10-03 11:54 UTC (permalink / raw)
To: Maciej Żenczykowski
Cc: Willy Tarreau, Robin Holt, David S. Miller, Alexey Kuznetsov,
Pekka Savola (ipv6), James Morris, Hideaki YOSHIFUJI,
Patrick McHardy, Vlad Yasevich, Sridhar Samudrala, linux-kernel,
netdev, linux-decnet-user, linux-sctp
In-Reply-To: <AANLkTin5wPvFQDFrupqGs_Jbh1rgrTjMupbPUFnvrBrv@mail.gmail.com>
On Sun, Oct 03, 2010 at 01:20:32AM -0700, Maciej Żenczykowski wrote:
> Isn't INT_MAX/2 just 1GB, which is only ~0.9 seconds at 10 Gbps?
Units matter. 1GB pages. We can limit to 2GB pages or 8TB.
Robin
^ permalink raw reply
* PROPOSAL..
From: Mrs Irina Gutavo @ 2010-10-03 11:40 UTC (permalink / raw)
I am Mrs Irina Gutavo a Cancer Patient,i hereby donate to you my £20
Million Pounds to set up a Charity foundation for my doctor recently
informed me that i have few weeks to live.
Please respond so i can have my Lawyer contact you with further details to
receive this inheritance.
Sincerely.
Mrs. Irina Gutavo.
^ permalink raw reply
* Re: [PATCH net-next V3] net: dynamic ingress_queue allocation
From: jamal @ 2010-10-03 13:10 UTC (permalink / raw)
To: Jarek Poplawski; +Cc: Eric Dumazet, David Miller, netdev
In-Reply-To: <20101003094221.GA2028@del.dom.local>
On Sun, 2010-10-03 at 11:42 +0200, Jarek Poplawski wrote:
> >
> > To avoid a NULL dereference some lines later.
> > Do I have a guarantee its not NULL here ?
>
> Do you have any scenario for NULL here? ;-)
This is why i called this part clever earlier ;-> It is
clever. There are several scenarios (i attempted to represent them
in the tests that Eric run):
1) ingress qdisc has been compiled in
flags & TCQ_F_INGRESS is true
a) user trying to add ingress qdisc first time
then q is null, new is not null and this would work
b) user trying to delete already added qdisc
then q is not null, new is null
2) ingress qdisc not compiled in
Repeat #1a above, and Eric's check will bail out ..
The one thing that may have been useful is to also try
a "replace" after #1a and maybe after #2
cheers,
jamal
^ permalink raw reply
* Re: [PATCH v12 12/17] Add mp(mediate passthru) device.
From: Michael S. Tsirkin @ 2010-10-03 13:12 UTC (permalink / raw)
To: xiaohui.xin; +Cc: netdev, kvm, linux-kernel, mingo, davem, herbert, jdike
In-Reply-To: <c898c79a9a73f531d790d9983bf01b9aa05752b1.1285853725.git.xiaohui.xin@intel.com>
On Thu, Sep 30, 2010 at 10:04:30PM +0800, xiaohui.xin@intel.com wrote:
> From: Xin Xiaohui <xiaohui.xin@intel.com>
>
> The patch add mp(mediate passthru) device, which now
> based on vhost-net backend driver and provides proto_ops
> to send/receive guest buffers data from/to guest vitio-net
> driver.
>
> Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
> Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
> Reviewed-by: Jeff Dike <jdike@linux.intel.com>
So you plan to rewrite all this to make this code part of macvtap?
> ---
> drivers/vhost/mpassthru.c | 1380 +++++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 1380 insertions(+), 0 deletions(-)
> create mode 100644 drivers/vhost/mpassthru.c
>
> diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
> new file mode 100644
> index 0000000..1a114d1
> --- /dev/null
> +++ b/drivers/vhost/mpassthru.c
> @@ -0,0 +1,1380 @@
> +/*
> + * MPASSTHRU - Mediate passthrough device.
> + * Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + */
> +
> +#define DRV_NAME "mpassthru"
> +#define DRV_DESCRIPTION "Mediate passthru device driver"
> +#define DRV_COPYRIGHT "(C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G"
> +
> +#include <linux/compat.h>
> +#include <linux/module.h>
> +#include <linux/errno.h>
> +#include <linux/kernel.h>
> +#include <linux/major.h>
> +#include <linux/slab.h>
> +#include <linux/smp_lock.h>
> +#include <linux/poll.h>
> +#include <linux/fcntl.h>
> +#include <linux/init.h>
> +#include <linux/aio.h>
> +
> +#include <linux/skbuff.h>
> +#include <linux/netdevice.h>
> +#include <linux/etherdevice.h>
> +#include <linux/miscdevice.h>
> +#include <linux/ethtool.h>
> +#include <linux/rtnetlink.h>
> +#include <linux/if.h>
> +#include <linux/if_arp.h>
> +#include <linux/if_ether.h>
> +#include <linux/crc32.h>
> +#include <linux/nsproxy.h>
> +#include <linux/uaccess.h>
> +#include <linux/virtio_net.h>
> +#include <linux/mpassthru.h>
> +#include <net/net_namespace.h>
> +#include <net/netns/generic.h>
> +#include <net/rtnetlink.h>
> +#include <net/sock.h>
> +
> +#include <asm/system.h>
> +
> +#define COPY_THRESHOLD (L1_CACHE_BYTES * 4)
> +#define COPY_HDR_LEN (L1_CACHE_BYTES < 64 ? 64 : L1_CACHE_BYTES)
> +
> +struct frag {
> + u16 offset;
> + u16 size;
> +};
> +
> +#define HASH_BUCKETS (8192*2)
> +
> +struct page_info {
> + struct list_head list;
> + struct page_info *next;
> + struct page_info *prev;
> + struct page *pages[MAX_SKB_FRAGS];
> + struct sk_buff *skb;
> + struct page_pool *pool;
> +
> + /* The pointer relayed to skb, to indicate
> + * it's a external allocated skb or kernel
> + */
> + struct skb_ext_page ext_page;
> + /* flag to indicate read or write */
> +#define INFO_READ 0
> +#define INFO_WRITE 1
> + unsigned flags;
> + /* exact number of locked pages */
> + unsigned pnum;
> +
> + /* The fields after that is for backend
> + * driver, now for vhost-net.
> + */
> + /* the kiocb structure related to */
> + struct kiocb *iocb;
> + /* the ring descriptor index */
> + unsigned int desc_pos;
> + /* the iovec coming from backend, we only
> + * need few of them */
> + struct iovec hdr[2];
> + struct iovec iov[2];
> +};
> +
> +static struct kmem_cache *ext_page_info_cache;
> +
> +struct page_pool {
> + /* the queue for rx side */
> + struct list_head readq;
> + /* the lock to protect readq */
> + spinlock_t read_lock;
> + /* record the orignal rlimit */
> + struct rlimit o_rlim;
> + /* record the locked pages */
> + int lock_pages;
> + /* the device according to */
> + struct net_device *dev;
> + /* the mp_port according to dev */
> + struct mp_port port;
> + /* the hash_table list to find each locked page */
> + struct page_info **hash_table;
> +};
> +
> +struct mp_struct {
> + struct mp_file *mfile;
> + struct net_device *dev;
> + struct page_pool *pool;
> + struct socket socket;
> +};
> +
> +struct mp_file {
> + atomic_t count;
> + struct mp_struct *mp;
> + struct net *net;
> +};
> +
> +struct mp_sock {
> + struct sock sk;
> + struct mp_struct *mp;
> +};
> +
> +/* The main function to allocate external buffers */
> +static struct skb_ext_page *page_ctor(struct mp_port *port,
> + struct sk_buff *skb,
> + int npages)
> +{
> + int i;
> + unsigned long flags;
> + struct page_pool *pool;
> + struct page_info *info = NULL;
> +
> + if (npages != 1)
> + BUG();
> + pool = container_of(port, struct page_pool, port);
> +
> + spin_lock_irqsave(&pool->read_lock, flags);
> + if (!list_empty(&pool->readq)) {
> + info = list_first_entry(&pool->readq, struct page_info, list);
> + list_del(&info->list);
> + }
> + spin_unlock_irqrestore(&pool->read_lock, flags);
> + if (!info)
> + return NULL;
> +
> + for (i = 0; i < info->pnum; i++)
> + get_page(info->pages[i]);
> + info->skb = skb;
> + return &info->ext_page;
> +}
> +
> +static struct page_info *mp_hash_lookup(struct page_pool *pool,
> + struct page *page);
> +static struct page_info *mp_hash_delete(struct page_pool *pool,
> + struct page_info *info);
> +
> +static struct skb_ext_page *mp_lookup(struct net_device *dev,
> + struct page *page)
> +{
> + struct mp_struct *mp =
> + container_of(dev->mp_port->sock->sk, struct mp_sock, sk)->mp;
> + struct page_pool *pool = mp->pool;
> + struct page_info *info;
> +
> + info = mp_hash_lookup(pool, page);
> + if (!info)
> + return NULL;
> + return &info->ext_page;
> +}
> +
> +static int page_pool_attach(struct mp_struct *mp)
> +{
> + int rc;
> + struct page_pool *pool;
> + struct net_device *dev = mp->dev;
> +
> + /* locked by mp_mutex */
> + if (mp->pool)
> + return -EBUSY;
> +
> + pool = kzalloc(sizeof(*pool), GFP_KERNEL);
> + if (!pool)
> + return -ENOMEM;
> + rc = netdev_mp_port_prep(dev, &pool->port);
> + if (rc)
> + goto fail;
> +
> + INIT_LIST_HEAD(&pool->readq);
> + spin_lock_init(&pool->read_lock);
> + pool->hash_table = kzalloc(sizeof(struct page_info *) * HASH_BUCKETS,
> + GFP_KERNEL);
> + if (!pool->hash_table)
> + goto fail;
> +
> + dev_hold(dev);
> + pool->dev = dev;
> + pool->port.ctor = page_ctor;
> + pool->port.sock = &mp->socket;
> + pool->port.hash = mp_lookup;
> + pool->lock_pages = 0;
> +
> + /* locked by mp_mutex */
> + dev->mp_port = &pool->port;
> + mp->pool = pool;
> +
> + return 0;
> +
> +fail:
> + kfree(pool);
> + dev_put(dev);
> +
> + return rc;
> +}
> +
> +struct page_info *info_dequeue(struct page_pool *pool)
> +{
> + unsigned long flags;
> + struct page_info *info = NULL;
> + spin_lock_irqsave(&pool->read_lock, flags);
> + if (!list_empty(&pool->readq)) {
> + info = list_first_entry(&pool->readq,
> + struct page_info, list);
> + list_del(&info->list);
> + }
> + spin_unlock_irqrestore(&pool->read_lock, flags);
> + return info;
> +}
> +
> +static int set_memlock_rlimit(struct page_pool *pool, int resource,
> + unsigned long cur, unsigned long max)
> +{
> + struct rlimit new_rlim, *old_rlim;
> + int retval;
> +
> + if (resource != RLIMIT_MEMLOCK)
> + return -EINVAL;
> + new_rlim.rlim_cur = cur;
> + new_rlim.rlim_max = max;
> +
> + old_rlim = current->signal->rlim + resource;
> +
> + /* remember the old rlimit value when backend enabled */
> + pool->o_rlim.rlim_cur = old_rlim->rlim_cur;
> + pool->o_rlim.rlim_max = old_rlim->rlim_max;
> +
> + if ((new_rlim.rlim_max > old_rlim->rlim_max) &&
> + !capable(CAP_SYS_RESOURCE))
> + return -EPERM;
> +
> + retval = security_task_setrlimit(resource, &new_rlim);
> + if (retval)
> + return retval;
> +
> + task_lock(current->group_leader);
> + *old_rlim = new_rlim;
> + task_unlock(current->group_leader);
> + return 0;
> +}
> +
> +static void mp_ki_dtor(struct kiocb *iocb)
> +{
> + struct page_info *info = (struct page_info *)(iocb->private);
> + int i;
> +
> + if (info->flags == INFO_READ) {
> + for (i = 0; i < info->pnum; i++) {
> + if (info->pages[i]) {
> + set_page_dirty_lock(info->pages[i]);
> + put_page(info->pages[i]);
> + }
> + }
> + mp_hash_delete(info->pool, info);
> + if (info->skb) {
> + info->skb->destructor = NULL;
> + kfree_skb(info->skb);
> + }
> + }
> + /* Decrement the number of locked pages */
> + info->pool->lock_pages -= info->pnum;
> + kmem_cache_free(ext_page_info_cache, info);
> +
> + return;
> +}
> +
> +static struct kiocb *create_iocb(struct page_info *info, int size)
> +{
> + struct kiocb *iocb = NULL;
> +
> + iocb = info->iocb;
> + if (!iocb)
> + return iocb;
> + iocb->ki_flags = 0;
> + iocb->ki_users = 1;
> + iocb->ki_key = 0;
> + iocb->ki_ctx = NULL;
> + iocb->ki_cancel = NULL;
> + iocb->ki_retry = NULL;
> + iocb->ki_eventfd = NULL;
> + iocb->ki_pos = info->desc_pos;
> + iocb->ki_nbytes = size;
> + iocb->ki_dtor(iocb);
> + iocb->private = (void *)info;
> + iocb->ki_dtor = mp_ki_dtor;
> +
> + return iocb;
> +}
> +
> +static int page_pool_detach(struct mp_struct *mp)
> +{
> + struct page_pool *pool;
> + struct page_info *info;
> + int i;
> +
> + /* locked by mp_mutex */
> + pool = mp->pool;
> + if (!pool)
> + return -ENODEV;
> +
> + while ((info = info_dequeue(pool))) {
> + for (i = 0; i < info->pnum; i++)
> + if (info->pages[i])
> + put_page(info->pages[i]);
> + create_iocb(info, 0);
> + kmem_cache_free(ext_page_info_cache, info);
> + }
> +
> + set_memlock_rlimit(pool, RLIMIT_MEMLOCK,
> + pool->o_rlim.rlim_cur,
> + pool->o_rlim.rlim_max);
> +
> + /* locked by mp_mutex */
> + pool->dev->mp_port = NULL;
> + dev_put(pool->dev);
> +
> + mp->pool = NULL;
> + kfree(pool->hash_table);
> + kfree(pool);
> + return 0;
> +}
> +
> +static void __mp_detach(struct mp_struct *mp)
> +{
> + mp->mfile = NULL;
> +
> + dev_change_flags(mp->dev, mp->dev->flags & ~IFF_UP);
> + page_pool_detach(mp);
> + dev_change_flags(mp->dev, mp->dev->flags | IFF_UP);
> +
> + /* Drop the extra count on the net device */
> + dev_put(mp->dev);
> +}
> +
> +static DEFINE_MUTEX(mp_mutex);
> +
> +static void mp_detach(struct mp_struct *mp)
> +{
> + mutex_lock(&mp_mutex);
> + __mp_detach(mp);
> + mutex_unlock(&mp_mutex);
> +}
> +
> +static struct mp_struct *mp_get(struct mp_file *mfile)
> +{
> + struct mp_struct *mp = NULL;
> + if (atomic_inc_not_zero(&mfile->count))
> + mp = mfile->mp;
> +
> + return mp;
> +}
> +
> +static void mp_put(struct mp_file *mfile)
> +{
> + if (atomic_dec_and_test(&mfile->count)) {
> + if (!rtnl_is_locked()) {
> + rtnl_lock();
> + mp_detach(mfile->mp);
> + rtnl_unlock();
> + } else
> + mp_detach(mfile->mp);
> + }
> +}
> +
> +static void iocb_tag(struct kiocb *iocb)
> +{
> + iocb->ki_flags = 1;
> +}
> +
> +/* The callback to destruct the external buffers or skb */
> +static void page_dtor(struct skb_ext_page *ext_page)
> +{
> + struct page_info *info;
> + struct page_pool *pool;
> + struct sock *sk;
> + struct sk_buff *skb;
> +
> + if (!ext_page)
> + return;
> + info = container_of(ext_page, struct page_info, ext_page);
> + if (!info)
> + return;
> + pool = info->pool;
> + skb = info->skb;
> +
> + if (info->flags == INFO_READ) {
> + create_iocb(info, 0);
> + return;
> + }
> +
> + /* For transmit, we should wait for the DMA finish by hardware.
> + * Queue the notifier to wake up the backend driver
> + */
> +
> + iocb_tag(info->iocb);
> + sk = pool->port.sock->sk;
> + sk->sk_write_space(sk);
> +
> + return;
> +}
> +
> +/* For small exteranl buffers transmit, we don't need to call
> + * get_user_pages().
> + */
> +static struct page_info *alloc_small_page_info(struct page_pool *pool,
> + struct kiocb *iocb, int total)
> +{
> + struct page_info *info =
> + kmem_cache_alloc(ext_page_info_cache, GFP_KERNEL);
> +
> + if (!info)
> + return NULL;
> + info->ext_page.dtor = page_dtor;
> + info->pool = pool;
> + info->flags = INFO_WRITE;
> + info->iocb = iocb;
> + info->pnum = 0;
> + return info;
> +}
> +
> +typedef u32 key_mp_t;
> +static inline key_mp_t mp_hash(struct page *page, int buckets)
> +{
> + key_mp_t k;
> +#if BITS_PER_LONG == 64
> + k = ((((unsigned long)page << 32UL) >> 32UL) /
> + sizeof(struct page)) % buckets ;
> +#elif BITS_PER_LONG == 32
> + k = ((unsigned long)page / sizeof(struct page)) % buckets;
> +#endif
> +
> + return k;
> +}
> +
> +static void mp_hash_insert(struct page_pool *pool,
> + struct page *page, struct page_info *page_info)
> +{
> + struct page_info *tmp;
> + key_mp_t key = mp_hash(page, HASH_BUCKETS);
> + if (!pool->hash_table[key]) {
> + pool->hash_table[key] = page_info;
> + return;
> + }
> +
> + tmp = pool->hash_table[key];
> + while (tmp->next)
> + tmp = tmp->next;
> +
> + tmp->next = page_info;
> + page_info->prev = tmp;
> + return;
> +}
> +
> +static struct page_info *mp_hash_delete(struct page_pool *pool,
> + struct page_info *info)
> +{
> + key_mp_t key = mp_hash(info->pages[0], HASH_BUCKETS);
> + struct page_info *tmp = NULL;
> +
> + tmp = pool->hash_table[key];
> + while (tmp) {
> + if (tmp == info) {
> + if (!tmp->prev) {
> + pool->hash_table[key] = tmp->next;
> + if (tmp->next)
> + tmp->next->prev = NULL;
> + } else {
> + tmp->prev->next = tmp->next;
> + if (tmp->next)
> + tmp->next->prev = tmp->prev;
> + }
> + return tmp;
> + }
> + tmp = tmp->next;
> + }
> + return tmp;
> +}
> +
> +static struct page_info *mp_hash_lookup(struct page_pool *pool,
> + struct page *page)
> +{
> + key_mp_t key = mp_hash(page, HASH_BUCKETS);
> + struct page_info *tmp = NULL;
> +
> + int i;
> + tmp = pool->hash_table[key];
> + while (tmp) {
> + for (i = 0; i < tmp->pnum; i++) {
> + if (tmp->pages[i] == page)
> + return tmp;
> + }
> + tmp = tmp->next;
> + }
> + return tmp;
> +}
> +
> +/* The main function to transform the guest user space address
> + * to host kernel address via get_user_pages(). Thus the hardware
> + * can do DMA directly to the external buffer address.
> + */
> +static struct page_info *alloc_page_info(struct page_pool *pool,
> + struct kiocb *iocb, struct iovec *iov,
> + int count, struct frag *frags,
> + int npages, int total)
> +{
> + int rc;
> + int i, j, n = 0;
> + int len;
> + unsigned long base, lock_limit;
> + struct page_info *info = NULL;
> +
> + lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
> + lock_limit >>= PAGE_SHIFT;
> +
> + if (pool->lock_pages + count > lock_limit && npages) {
> + printk(KERN_INFO "exceed the locked memory rlimit.");
> + return NULL;
> + }
> +
> + info = kmem_cache_alloc(ext_page_info_cache, GFP_KERNEL);
> +
> + if (!info)
> + return NULL;
> + info->skb = NULL;
> + info->next = info->prev = NULL;
> +
> + for (i = j = 0; i < count; i++) {
> + base = (unsigned long)iov[i].iov_base;
> + len = iov[i].iov_len;
> +
> + if (!len)
> + continue;
> + n = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
> +
> + rc = get_user_pages_fast(base, n, npages ? 1 : 0,
> + &info->pages[j]);
> + if (rc != n)
> + goto failed;
> +
> + while (n--) {
> + frags[j].offset = base & ~PAGE_MASK;
> + frags[j].size = min_t(int, len,
> + PAGE_SIZE - frags[j].offset);
> + len -= frags[j].size;
> + base += frags[j].size;
> + j++;
> + }
> + }
> +
> +#ifdef CONFIG_HIGHMEM
> + if (npages && !(dev->features & NETIF_F_HIGHDMA)) {
> + for (i = 0; i < j; i++) {
> + if (PageHighMem(info->pages[i]))
> + goto failed;
> + }
> + }
> +#endif
> +
> + info->ext_page.dtor = page_dtor;
> + info->ext_page.page = info->pages[0];
> + info->pool = pool;
> + info->pnum = j;
> + info->iocb = iocb;
> + if (!npages)
> + info->flags = INFO_WRITE;
> + else
> + info->flags = INFO_READ;
> +
> + if (info->flags == INFO_READ) {
> + if (frags[0].offset == 0 && iocb->ki_iovec[0].iov_len) {
> + frags[0].offset = iocb->ki_iovec[0].iov_len;
> + pool->port.vnet_hlen = iocb->ki_iovec[0].iov_len;
> + }
> + for (i = 0; i < j; i++)
> + mp_hash_insert(pool, info->pages[i], info);
> + }
> + /* increment the number of locked pages */
> + pool->lock_pages += j;
> + return info;
> +
> +failed:
> + for (i = 0; i < j; i++)
> + put_page(info->pages[i]);
> +
> + kmem_cache_free(ext_page_info_cache, info);
> +
> + return NULL;
> +}
> +
> +static void mp_sock_destruct(struct sock *sk)
> +{
> + struct mp_struct *mp = container_of(sk, struct mp_sock, sk)->mp;
> + kfree(mp);
> +}
> +
> +static void mp_sock_state_change(struct sock *sk)
> +{
> + if (sk_has_sleeper(sk))
> + wake_up_interruptible_sync_poll(sk->sk_sleep, POLLIN);
> +}
> +
> +static void mp_sock_write_space(struct sock *sk)
> +{
> + if (sk_has_sleeper(sk))
> + wake_up_interruptible_sync_poll(sk->sk_sleep, POLLOUT);
> +}
> +
> +static void mp_sock_data_ready(struct sock *sk, int coming)
> +{
> + struct mp_struct *mp = container_of(sk, struct mp_sock, sk)->mp;
> + struct page_pool *pool = NULL;
> + struct sk_buff *skb = NULL;
> + struct page_info *info = NULL;
> + int len;
> +
> + pool = mp->pool;
> + if (!pool)
> + return;
> +
> + while ((skb = skb_dequeue(&sk->sk_receive_queue)) != NULL) {
> + struct page *page;
> + int off;
> + int size = 0, i = 0;
> + struct skb_shared_info *shinfo = skb_shinfo(skb);
> + struct skb_ext_page *ext_page =
> + (struct skb_ext_page *)(shinfo->destructor_arg);
> + struct virtio_net_hdr_mrg_rxbuf hdr = {
> + .hdr.flags = 0,
> + .hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE
> + };
> +
> + if (skb->ip_summed == CHECKSUM_COMPLETE)
> + printk(KERN_INFO "Complete checksum occurs\n");
> +
> + if (shinfo->frags[0].page == ext_page->page) {
> + info = container_of(ext_page,
> + struct page_info,
> + ext_page);
> + if (shinfo->nr_frags)
> + hdr.num_buffers = shinfo->nr_frags;
> + else
> + hdr.num_buffers = shinfo->nr_frags + 1;
> + } else {
> + info = container_of(ext_page,
> + struct page_info,
> + ext_page);
> + hdr.num_buffers = shinfo->nr_frags + 1;
> + }
> + skb_push(skb, ETH_HLEN);
> +
> + if (skb_is_gso(skb)) {
> + hdr.hdr.hdr_len = skb_headlen(skb);
> + hdr.hdr.gso_size = shinfo->gso_size;
> + if (shinfo->gso_type & SKB_GSO_TCPV4)
> + hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
> + else if (shinfo->gso_type & SKB_GSO_TCPV6)
> + hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
> + else if (shinfo->gso_type & SKB_GSO_UDP)
> + hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_UDP;
> + else
> + BUG();
> + if (shinfo->gso_type & SKB_GSO_TCP_ECN)
> + hdr.hdr.gso_type |= VIRTIO_NET_HDR_GSO_ECN;
> +
> + } else
> + hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE;
> +
> + if (skb->ip_summed == CHECKSUM_PARTIAL) {
> + hdr.hdr.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
> + hdr.hdr.csum_start =
> + skb->csum_start - skb_headroom(skb);
> + hdr.hdr.csum_offset = skb->csum_offset;
> + }
> +
> + off = info->hdr[0].iov_len;
> + len = memcpy_toiovec(info->iov, (unsigned char *)&hdr, off);
> + if (len) {
> + pr_debug("Unable to write vnet_hdr at addr '%p': '%d'\n",
> + info->iov, len);
> + goto clean;
> + }
> +
> + memcpy_toiovec(info->iov, skb->data, skb_headlen(skb));
> +
> + info->iocb->ki_left = hdr.num_buffers;
> + if (shinfo->frags[0].page == ext_page->page) {
> + size = shinfo->frags[0].size +
> + shinfo->frags[0].page_offset - off;
> + i = 1;
> + } else {
> + size = skb_headlen(skb);
> + i = 0;
> + }
> + create_iocb(info, off + size);
> + for (i = i; i < shinfo->nr_frags; i++) {
> + page = shinfo->frags[i].page;
> + info = mp_hash_lookup(pool, shinfo->frags[i].page);
> + create_iocb(info, shinfo->frags[i].size);
> + }
> + info->skb = skb;
> + shinfo->nr_frags = 0;
> + shinfo->destructor_arg = NULL;
> + continue;
> +clean:
> + kfree_skb(skb);
> + for (i = 0; i < info->pnum; i++)
> + put_page(info->pages[i]);
> + kmem_cache_free(ext_page_info_cache, info);
> + }
> + return;
> +}
> +
> +static inline struct sk_buff *mp_alloc_skb(struct sock *sk, size_t prepad,
> + size_t len, size_t linear,
> + int noblock, int *err)
> +{
> + struct sk_buff *skb;
> +
> + /* Under a page? Don't bother with paged skb. */
> + if (prepad + len < PAGE_SIZE || !linear)
> + linear = len;
> +
> + skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock,
> + err);
> + if (!skb)
> + return NULL;
> +
> + skb_reserve(skb, prepad);
> + skb_put(skb, linear);
> + skb->data_len = len - linear;
> + skb->len += len - linear;
> +
> + return skb;
> +}
> +
> +static int mp_skb_from_vnet_hdr(struct sk_buff *skb,
> + struct virtio_net_hdr *vnet_hdr)
> +{
> + unsigned short gso_type = 0;
> + if (vnet_hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
> + switch (vnet_hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
> + case VIRTIO_NET_HDR_GSO_TCPV4:
> + gso_type = SKB_GSO_TCPV4;
> + break;
> + case VIRTIO_NET_HDR_GSO_TCPV6:
> + gso_type = SKB_GSO_TCPV6;
> + break;
> + case VIRTIO_NET_HDR_GSO_UDP:
> + gso_type = SKB_GSO_UDP;
> + break;
> + default:
> + return -EINVAL;
> + }
> +
> + if (vnet_hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
> + gso_type |= SKB_GSO_TCP_ECN;
> +
> + if (vnet_hdr->gso_size == 0)
> + return -EINVAL;
> + }
> +
> + if (vnet_hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
> + if (!skb_partial_csum_set(skb, vnet_hdr->csum_start,
> + vnet_hdr->csum_offset))
> + return -EINVAL;
> + }
> +
> + if (vnet_hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
> + skb_shinfo(skb)->gso_size = vnet_hdr->gso_size;
> + skb_shinfo(skb)->gso_type = gso_type;
> +
> + /* Header must be checked, and gso_segs computed. */
> + skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
> + skb_shinfo(skb)->gso_segs = 0;
> + }
> + return 0;
> +}
> +
> +static int mp_sendmsg(struct kiocb *iocb, struct socket *sock,
> + struct msghdr *m, size_t total_len)
> +{
> + struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
> + struct virtio_net_hdr vnet_hdr = {0};
> + int hdr_len = 0;
> + struct page_pool *pool;
> + struct iovec *iov = m->msg_iov;
> + struct page_info *info = NULL;
> + struct frag frags[MAX_SKB_FRAGS];
> + struct sk_buff *skb;
> + int count = m->msg_iovlen;
> + int total = 0, header, n, i, len, rc;
> + unsigned long base;
> +
> + pool = mp->pool;
> + if (!pool)
> + return -ENODEV;
> +
> + total = iov_length(iov, count);
> +
> + if (total < ETH_HLEN)
> + return -EINVAL;
> +
> + if (total <= COPY_THRESHOLD)
> + goto copy;
> +
> + n = 0;
> + for (i = 0; i < count; i++) {
> + base = (unsigned long)iov[i].iov_base;
> + len = iov[i].iov_len;
> + if (!len)
> + continue;
> + n += ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
> + if (n > MAX_SKB_FRAGS)
> + return -EINVAL;
> + }
> +
> +copy:
> + hdr_len = sizeof(vnet_hdr);
> + if ((total - iocb->ki_iovec[0].iov_len) < 0)
> + return -EINVAL;
> +
> + rc = memcpy_fromiovecend((void *)&vnet_hdr, iocb->ki_iovec, 0, hdr_len);
> + if (rc < 0)
> + return -EINVAL;
> +
> + if ((vnet_hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
> + vnet_hdr.csum_start + vnet_hdr.csum_offset + 2 >
> + vnet_hdr.hdr_len)
> + vnet_hdr.hdr_len = vnet_hdr.csum_start +
> + vnet_hdr.csum_offset + 2;
> +
> + if (vnet_hdr.hdr_len > total)
> + return -EINVAL;
> +
> + header = total > COPY_THRESHOLD ? COPY_HDR_LEN : total;
> +
> + skb = mp_alloc_skb(sock->sk, NET_IP_ALIGN, header,
> + iocb->ki_iovec[0].iov_len, 1, &rc);
> +
> + if (!skb)
> + goto drop;
> +
> + skb_set_network_header(skb, ETH_HLEN);
> + memcpy_fromiovec(skb->data, iov, header);
> +
> + skb_reset_mac_header(skb);
> + skb->protocol = eth_hdr(skb)->h_proto;
> +
> + rc = mp_skb_from_vnet_hdr(skb, &vnet_hdr);
> + if (rc)
> + goto drop;
> +
> + if (header == total) {
> + rc = total;
> + info = alloc_small_page_info(pool, iocb, total);
> + } else {
> + info = alloc_page_info(pool, iocb, iov, count, frags, 0, total);
> + if (info)
> + for (i = 0; i < info->pnum; i++) {
> + skb_add_rx_frag(skb, i, info->pages[i],
> + frags[i].offset, frags[i].size);
> + info->pages[i] = NULL;
> + }
> + }
> + if (!pool->lock_pages)
> + sock->sk->sk_state_change(sock->sk);
> +
> + if (info != NULL) {
> + info->desc_pos = iocb->ki_pos;
> + info->skb = skb;
> + skb_shinfo(skb)->destructor_arg = &info->ext_page;
> + skb->dev = mp->dev;
> + create_iocb(info, total);
> + dev_queue_xmit(skb);
> + return 0;
> + }
> +drop:
> + kfree_skb(skb);
> + if (info) {
> + for (i = 0; i < info->pnum; i++)
> + put_page(info->pages[i]);
> + kmem_cache_free(ext_page_info_cache, info);
> + }
> + mp->dev->stats.tx_dropped++;
> + return -ENOMEM;
> +}
> +
> +static int mp_recvmsg(struct kiocb *iocb, struct socket *sock,
> + struct msghdr *m, size_t total_len,
> + int flags)
> +{
> + struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
> + struct page_pool *pool;
> + struct iovec *iov = m->msg_iov;
> + int count = m->msg_iovlen;
> + int npages, payload;
> + struct page_info *info;
> + struct frag frags[MAX_SKB_FRAGS];
> + unsigned long base;
> + int i, len;
> + unsigned long flag;
> +
> + if (!(flags & MSG_DONTWAIT))
> + return -EINVAL;
> +
> + pool = mp->pool;
> + if (!pool)
> + return -EINVAL;
> +
> + /* Error detections in case invalid external buffer */
> + if (count > 2 && iov[1].iov_len < pool->port.hdr_len &&
> + mp->dev->features & NETIF_F_SG) {
> + return -EINVAL;
> + }
> +
> + npages = pool->port.npages;
> + payload = pool->port.data_len;
> +
> + /* If KVM guest virtio-net FE driver use SG feature */
> + if (count > 2) {
> + for (i = 2; i < count; i++) {
> + base = (unsigned long)iov[i].iov_base & ~PAGE_MASK;
> + len = iov[i].iov_len;
> + if (npages == 1)
> + len = min_t(int, len, PAGE_SIZE - base);
> + else if (base)
> + break;
> + payload -= len;
> + if (payload <= 0)
> + goto proceed;
> + if (npages == 1 || (len & ~PAGE_MASK))
> + break;
> + }
> + }
> +
> + if ((((unsigned long)iov[1].iov_base & ~PAGE_MASK)
> + - NET_SKB_PAD - NET_IP_ALIGN) >= 0)
> + goto proceed;
> +
> + return -EINVAL;
> +
> +proceed:
> + /* skip the virtnet head */
> + if (count > 1) {
> + iov++;
> + count--;
> + }
> +
> + if (!pool->lock_pages) {
> + set_memlock_rlimit(pool, RLIMIT_MEMLOCK,
> + iocb->ki_user_data * 4096 * 2,
> + iocb->ki_user_data * 4096 * 2);
> + }
> +
> + /* Translate address to kernel */
> + info = alloc_page_info(pool, iocb, iov, count, frags, npages, 0);
> + if (!info)
> + return -ENOMEM;
> + info->hdr[0].iov_base = iocb->ki_iovec[0].iov_base;
> + info->hdr[0].iov_len = iocb->ki_iovec[0].iov_len;
> + iocb->ki_iovec[0].iov_len = 0;
> + iocb->ki_left = 0;
> + info->desc_pos = iocb->ki_pos;
> +
> + if (count > 1) {
> + iov--;
> + count++;
> + }
> +
> + memcpy(info->iov, iov, sizeof(struct iovec) * count);
> +
> + spin_lock_irqsave(&pool->read_lock, flag);
> + list_add_tail(&info->list, &pool->readq);
> + spin_unlock_irqrestore(&pool->read_lock, flag);
> +
> + return 0;
> +}
> +
> +/* Ops structure to mimic raw sockets with mp device */
> +static const struct proto_ops mp_socket_ops = {
> + .sendmsg = mp_sendmsg,
> + .recvmsg = mp_recvmsg,
> +};
> +
> +static struct proto mp_proto = {
> + .name = "mp",
> + .owner = THIS_MODULE,
> + .obj_size = sizeof(struct mp_sock),
> +};
> +
> +static int mp_chr_open(struct inode *inode, struct file * file)
> +{
> + struct mp_file *mfile;
> + cycle_kernel_lock();
> +
> + pr_debug("mp: mp_chr_open\n");
> + mfile = kzalloc(sizeof(*mfile), GFP_KERNEL);
> + if (!mfile)
> + return -ENOMEM;
> + atomic_set(&mfile->count, 0);
> + mfile->mp = NULL;
> + mfile->net = get_net(current->nsproxy->net_ns);
> + file->private_data = mfile;
> + return 0;
> +}
> +
> +static int mp_attach(struct mp_struct *mp, struct file *file)
> +{
> + struct mp_file *mfile = file->private_data;
> + int err;
> +
> + netif_tx_lock_bh(mp->dev);
> +
> + err = -EINVAL;
> +
> + if (mfile->mp)
> + goto out;
> +
> + err = -EBUSY;
> + if (mp->mfile)
> + goto out;
> +
> + err = 0;
> + mfile->mp = mp;
> + mp->mfile = mfile;
> + mp->socket.file = file;
> + dev_hold(mp->dev);
> + sock_hold(mp->socket.sk);
> + atomic_inc(&mfile->count);
> +
> +out:
> + netif_tx_unlock_bh(mp->dev);
> + return err;
> +}
> +
> +static int do_unbind(struct mp_file *mfile)
> +{
> + struct mp_struct *mp = mp_get(mfile);
> +
> + if (!mp)
> + return -EINVAL;
> +
> + mp_detach(mp);
> + sock_put(mp->socket.sk);
> + mp_put(mfile);
> + return 0;
> +}
> +
> +static long mp_chr_ioctl(struct file *file, unsigned int cmd,
> + unsigned long arg)
> +{
> + struct mp_file *mfile = file->private_data;
> + struct mp_struct *mp;
> + struct net_device *dev;
> + void __user* argp = (void __user *)arg;
> + struct ifreq ifr;
> + struct sock *sk;
> + int ret;
> +
> + ret = -EINVAL;
> +
> + switch (cmd) {
> + case MPASSTHRU_BINDDEV:
> + ret = -EFAULT;
> + if (copy_from_user(&ifr, argp, sizeof ifr))
> + break;
> +
> + ifr.ifr_name[IFNAMSIZ-1] = '\0';
> +
> + ret = -ENODEV;
> +
> + rtnl_lock();
> + dev = dev_get_by_name(mfile->net, ifr.ifr_name);
> + if (!dev) {
> + rtnl_unlock();
> + break;
> + }
> +
> + mutex_lock(&mp_mutex);
> +
> + ret = -EBUSY;
> +
> + /* the device can be only bind once */
> + if (dev_is_mpassthru(dev))
> + goto err_dev_put;
> +
> + mp = mfile->mp;
> + if (mp)
> + goto err_dev_put;
> +
> + mp = kzalloc(sizeof(*mp), GFP_KERNEL);
> + if (!mp) {
> + ret = -ENOMEM;
> + goto err_dev_put;
> + }
> + mp->dev = dev;
> + ret = -ENOMEM;
> +
> + sk = sk_alloc(mfile->net, AF_UNSPEC, GFP_KERNEL, &mp_proto);
> + if (!sk)
> + goto err_free_mp;
> +
> + init_waitqueue_head(&mp->socket.wait);
> + mp->socket.ops = &mp_socket_ops;
> + sock_init_data(&mp->socket, sk);
> + sk->sk_sndbuf = INT_MAX;
> + container_of(sk, struct mp_sock, sk)->mp = mp;
> +
> + sk->sk_destruct = mp_sock_destruct;
> + sk->sk_data_ready = mp_sock_data_ready;
> + sk->sk_write_space = mp_sock_write_space;
> + sk->sk_state_change = mp_sock_state_change;
> + ret = mp_attach(mp, file);
> + if (ret < 0)
> + goto err_free_sk;
> +
> + ret = page_pool_attach(mp);
> + if (ret < 0)
> + goto err_free_sk;
> + dev_change_flags(mp->dev, mp->dev->flags & (~IFF_UP));
> + dev_change_flags(mp->dev, mp->dev->flags | IFF_UP);
> + sk->sk_state_change(sk);
> +out:
> + mutex_unlock(&mp_mutex);
> + rtnl_unlock();
> + break;
> +err_free_sk:
> + sk_free(sk);
> +err_free_mp:
> + kfree(mp);
> +err_dev_put:
> + dev_put(dev);
> + goto out;
> +
> + case MPASSTHRU_UNBINDDEV:
> + rtnl_lock();
> + ret = do_unbind(mfile);
> + rtnl_unlock();
> + break;
> +
> + default:
> + break;
> + }
> + return ret;
> +}
> +
> +static unsigned int mp_chr_poll(struct file *file, poll_table * wait)
> +{
> + struct mp_file *mfile = file->private_data;
> + struct mp_struct *mp = mp_get(mfile);
> + struct sock *sk;
> + unsigned int mask = 0;
> +
> + if (!mp)
> + return POLLERR;
> +
> + sk = mp->socket.sk;
> +
> + poll_wait(file, &mp->socket.wait, wait);
> +
> + if (!skb_queue_empty(&sk->sk_receive_queue))
> + mask |= POLLIN | POLLRDNORM;
> +
> + if (sock_writeable(sk) ||
> + (!test_and_set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags) &&
> + sock_writeable(sk)))
> + mask |= POLLOUT | POLLWRNORM;
> +
> + if (mp->dev->reg_state != NETREG_REGISTERED)
> + mask = POLLERR;
> +
> + mp_put(mfile);
> + return mask;
> +}
> +
> +static ssize_t mp_chr_aio_write(struct kiocb *iocb, const struct iovec *iov,
> + unsigned long count, loff_t pos)
> +{
> + struct file *file = iocb->ki_filp;
> + struct mp_struct *mp = mp_get(file->private_data);
> + struct sock *sk = mp->socket.sk;
> + struct sk_buff *skb;
> + int len, err;
> + ssize_t result = 0;
> +
> + if (!mp)
> + return -EBADFD;
> +
> + /* currently, async is not supported.
> + * but we may support real async aio from user application,
> + * maybe qemu virtio-net backend.
> + */
> + if (!is_sync_kiocb(iocb))
> + return -EFAULT;
> +
> + len = iov_length(iov, count);
> +
> + if (unlikely(len < ETH_HLEN))
> + return -EINVAL;
> +
> + skb = sock_alloc_send_skb(sk, len + NET_IP_ALIGN,
> + file->f_flags & O_NONBLOCK, &err);
> +
> + if (!skb)
> + return -ENOMEM;
> +
> + skb_reserve(skb, NET_IP_ALIGN);
> + skb_put(skb, len);
> +
> + if (skb_copy_datagram_from_iovec(skb, 0, iov, 0, len)) {
> + kfree_skb(skb);
> + return -EAGAIN;
> + }
> +
> + skb->protocol = eth_type_trans(skb, mp->dev);
> + skb->dev = mp->dev;
> +
> + dev_queue_xmit(skb);
> +
> + mp_put(file->private_data);
> + return result;
> +}
> +
> +static int mp_chr_close(struct inode *inode, struct file *file)
> +{
> + struct mp_file *mfile = file->private_data;
> +
> + /*
> + * Ignore return value since an error only means there was nothing to
> + * do
> + */
> + do_unbind(mfile);
> +
> + put_net(mfile->net);
> + kfree(mfile);
> +
> + return 0;
> +}
> +
> +#ifdef CONFIG_COMPAT
> +static long mp_chr_compat_ioctl(struct file *f, unsigned int ioctl,
> + unsigned long arg)
> +{
> + return mp_chr_ioctl(f, ioctl, (unsigned long)compat_ptr(arg));
> +}
> +#endif
> +
> +static const struct file_operations mp_fops = {
> + .owner = THIS_MODULE,
> + .llseek = no_llseek,
> + .write = do_sync_write,
> + .aio_write = mp_chr_aio_write,
> + .poll = mp_chr_poll,
> + .unlocked_ioctl = mp_chr_ioctl,
> +#ifdef CONFIG_COMPAT
> + .compat_ioctl = mp_chr_compat_ioctl,
> +#endif
> + .open = mp_chr_open,
> + .release = mp_chr_close,
> +};
> +
> +static struct miscdevice mp_miscdev = {
> + .minor = MISC_DYNAMIC_MINOR,
> + .name = "mp",
> + .nodename = "net/mp",
> + .fops = &mp_fops,
> +};
> +
> +static int mp_device_event(struct notifier_block *unused,
> + unsigned long event, void *ptr)
> +{
> + struct net_device *dev = ptr;
> + struct mp_port *port;
> + struct mp_struct *mp = NULL;
> + struct socket *sock = NULL;
> + struct sock *sk;
> +
> + port = dev->mp_port;
> + if (port == NULL)
> + return NOTIFY_DONE;
> +
> + switch (event) {
> + case NETDEV_UNREGISTER:
> + sock = dev->mp_port->sock;
> + mp = container_of(sock->sk, struct mp_sock, sk)->mp;
> + do_unbind(mp->mfile);
> + break;
> + case NETDEV_CHANGE:
> + sk = dev->mp_port->sock->sk;
> + sk->sk_state_change(sk);
> + break;
> + }
> + return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block mp_notifier_block __read_mostly = {
> + .notifier_call = mp_device_event,
> +};
> +
> +static int mp_init(void)
> +{
> + int err = 0;
> +
> + ext_page_info_cache = kmem_cache_create("skb_page_info",
> + sizeof(struct page_info),
> + 0, SLAB_HWCACHE_ALIGN, NULL);
> + if (!ext_page_info_cache)
> + return -ENOMEM;
> +
> + err = misc_register(&mp_miscdev);
> + if (err) {
> + printk(KERN_ERR "mp: Can't register misc device\n");
> + kmem_cache_destroy(ext_page_info_cache);
> + } else {
> + printk(KERN_INFO "Registering mp misc device - minor = %d\n",
> + mp_miscdev.minor);
> + register_netdevice_notifier(&mp_notifier_block);
> + }
> + return err;
> +}
> +
> +void mp_exit(void)
> +{
> + unregister_netdevice_notifier(&mp_notifier_block);
> + misc_deregister(&mp_miscdev);
> + kmem_cache_destroy(ext_page_info_cache);
> +}
> +
> +/* Get an underlying socket object from mp file. Returns error unless file is
> + * attached to a device. The returned object works like a packet socket, it
> + * can be used for sock_sendmsg/sock_recvmsg. The caller is responsible for
> + * holding a reference to the file for as long as the socket is in use. */
> +struct socket *mp_get_socket(struct file *file)
> +{
> + struct mp_file *mfile = file->private_data;
> + struct mp_struct *mp;
> +
> + if (file->f_op != &mp_fops)
> + return ERR_PTR(-EINVAL);
> + mp = mp_get(mfile);
> + if (!mp)
> + return ERR_PTR(-EBADFD);
> + mp_put(mfile);
> + return &mp->socket;
> +}
> +EXPORT_SYMBOL_GPL(mp_get_socket);
> +
> +module_init(mp_init);
> +module_exit(mp_exit);
> +MODULE_AUTHOR(DRV_COPYRIGHT);
> +MODULE_DESCRIPTION(DRV_DESCRIPTION);
> +MODULE_LICENSE("GPL v2");
> --
> 1.7.3
^ permalink raw reply
* Re: [PATCHv3 net-next-2.6 3/5] XFRM,IPv6: Add IRO src/dst address remapping XFRM types and i/o handlers
From: Arnaud Ebalard @ 2010-10-03 13:41 UTC (permalink / raw)
To: Herbert Xu; +Cc: David Miller, eric.dumazet, yoshfuji, netdev
In-Reply-To: <20101002103205.GA3879@gondor.apana.org.au>
Hi Herbert,
Herbert Xu <herbert@gondor.apana.org.au> writes:
> On Sat, Oct 02, 2010 at 12:17:35PM +0200, Arnaud Ebalard wrote:
>>
>> and I see no reason not to keep the lock we have on the state until the
>> end of the function when the state is valid (when we break), instead of
>> releasing it to get it again later. Something like the following would
>> allow removing the spin_lock()/spin_unlock() calls from all mip6 input
>> handlers (mip6_{destopt,rthdr,iro_src,iro_dst}_input()):
>
> No I moved the state lock down precisely because it should not
> be taken at a higher level as that breaks asynchronous IPsec
> processing and the fact that it isn't needed in most places.
>
> If your code needs it then you should take it rather than impose
> it on real IPsec users.
Understood. Note that I am on your side with this: my primary concern
while pushing the feature is *to not break or slow down standard IPsec*.
I do not expect my code to be accepted or even read otherwise.
As for the current point raised by David on the position of the locks in
my input handlers, they are based on the position of the locks in the
*existing* RH2 (mip6_rthdr_input()) and HAO (mip6_destopt_input())
handlers. As they serve the same purpose (src/dst address check against
state's address) and the code is basically the same, I have no reason to
do things differently as what is currently upstream.
After your reply, I took a (too long) look at the history of
xfrm6_input_addr() to understand why it is as it is. If it can spare you
some time, here is what I think happened:
- Initially (commit fbd9a5b4, Aug 23 2006), the checks on the status of
state, the call to x->type->input() and the changes on state's
processing stats (x->curlft changes) were *globally* protected by a
call to spin_lock(). The same day, a related commit (3d126890) added
support for RH2/HAO input handler. No lock inside the handler. The
content of xfrm6_input_addr() was:
spin_lock(&x->lock);
<...snip...>
nh = x->type->input(x, skb);
if (nh <= 0) {
spin_unlock(&x->lock);
xfrm_state_put(x);
x = NULL;
continue;
}
x->curlft.bytes += skb->len;
x->curlft.packets++;
spin_unlock(&x->lock);
- Then, as you wrote, the state lock was moved in all input handlers
(commit 0ebea8ef, Nov 13 2007), including RH2/HAO ones:
@@ -128,12 +128,15 @@ static int mip6_destopt_input(struct xfrm_state *x, struct sk_buff *skb)
{
struct ipv6hdr *iph = ipv6_hdr(skb);
struct ipv6_destopt_hdr *destopt = (struct ipv6_destopt_hdr *)skb->data;
+ int err = destopt->nexthdr;
+ spin_lock(&x->lock);
if (!ipv6_addr_equal(&iph->saddr, (struct in6_addr *)x->coaddr) &&
!ipv6_addr_any((struct in6_addr *)x->coaddr))
- return -ENOENT;
+ err = -ENOENT;
+ spin_unlock(&x->lock);
- return destopt->nexthdr;
+ return err;
}
With that commit, I think a deadlock was introduced in MIPv6 code
because xfm6_input_addr() was left unchanged, i.e. x->type->input()
was called with the lock held. Am I correct?
- The code of xfrm6_input_addr() was then optimized by commit a002c6fd
in such a way that x->type->input() was then put outside the
protection of the lock, which (if I am not mistaken) removed the
deadlock:
spin_lock(&x->lock);
if ((!i || (x->props.flags & XFRM_STATE_WILDRECV)) &&
likely(x->km.state == XFRM_STATE_VALID) &&
!xfrm_state_check_expire(x)) {
spin_unlock(&x->lock);
if (x->type->input(x, skb) > 0) {
/* found a valid state */
break;
}
} else
spin_unlock(&x->lock);
I don't know if this is was intentional.
But the main question remains on the position of the lock. Here,
checks are done on the status of the state, lock is released,
reacquired in the input handler to do additional check and then
released again, to be reacquired later in the function to act on
statistics. Is my reading of the code correct?
Herbert, you certainly have a better understanding of XFRM code than I
have and can probably tell if the locking behavior above is valid or
buggy. Yoshifuji-san, David or Eric may also have good ideas on that.
As a side note (I think I was not explicit enough in my previous email),
I think the possible changes to xfrm_input_addr() and MIPv6 handlers we
are discussing are not expected to impact standard IPsec code because
there are 2 different cases in which states input handlers are called
(i.e. x->type->input()):
- xfrm_input(): for standard IPsec case (incl. async resumption). This
is only for esp, ah, ipcomp and tunneling.
- xfrm6_input_addr(): for MIPv6 extension header, i.e. RH2 and HAO in
destopt.
and we are discussing the second.
David, as for my patches, if this is ok for you, I will keep the code of
my input handlers aligned on the code of RH2/HAO handlers and will modify
it later based on the possible corrections made on those upstream.
Don't hesitate to slap me if I made some mistakes in my analysis ;-)
Cheers,
a+
^ permalink raw reply
* Re: [PATCH 8/8] net: Implement socketat.
From: jamal @ 2010-10-03 13:44 UTC (permalink / raw)
To: Daniel Lezcano
Cc: Pavel Emelyanov, Eric W. Biederman, linux-kernel,
Linux Containers, netdev, netfilter-devel, linux-fsdevel,
Linus Torvalds, Michael Kerrisk, Ulrich Drepper, Al Viro,
David Miller, Serge E. Hallyn, Pavel Emelyanov, Ben Greear,
Matt Helsley, Jonathan Corbet, Sukadev Bhattiprolu,
Jan Engelhardt, Patrick McHardy
In-Reply-To: <4CA7A07C.5030504@free.fr>
Hi Daniel,
Thanks for clarifying this ..
On Sat, 2010-10-02 at 23:13 +0200, Daniel Lezcano wrote:
> Just to clarify this point. You enter the namespace, create the socket
> and go back to the initial namespace (or create a new one). Further
> operations can be made against this fd because it is the network
> namespace stored in the sock struct which is used, not the current
> process network namespace which is used at the socket creation only.
>
> We can actually already do that by unsharing and then create a
> socket.
> This socket will pin the namespace and can be used as a control socket
> for the namespace (assuming the socket domain will be ok for all the
> operations).
>
> Jamal, I don't know what kind of application you want to use but if I
> assume you want to create a process controlling 1024 netns,
At the moment i am looking at 8K on a Nehalem with lots of RAM. They
will mostly be created at startup but some could be created afterwards.
Each will have its own netdevs etc. also created at startup (and some
other config that may happen later).
Because startup time may accumulate, it is clearly important to me
to pick whatever scheme that reduces the number of calls...
> let's try to identificate what happen with setns and with socketat :
>
> With setns:
>
> * open /proc/self/ns/net (1)
> * unshare the netns
> * open /proc/self/ns/net (2)
> * setns (1)
> * create a virtual network device
> * move the virtual device to (2) (using the set netns by fd)
> * unshare the netns
> ...
>
> With socketat:
>
> * open a socket (1)
> * unshare the netns
> * open a netlink with socketat(1) => (2)
> * create a virtual device using (2) (at this point it is
> init_net_ns)
> * move the virtual device to the current netns (using the set
> netns
> by pid)
> * open a socket (3)
> * unshare the netns
> ...
>
> We have the same number of file descriptors kept opened. Except, with
> setns we can bind mount the directory somewhere, that will pin the
> namespace and then we can close the /proc/self/ns/net file descriptors
> and reopen them later.
>
Ok, so a wrapper such as: create_socket_on(namespaceid)
will have generally less system calls with socketat()
> If your application has to do a lot of specific network processing,
> during its life cycle, in different namespaces, the socketat syscall
> will be better because it will reduce the number of syscalls but at
> the cost of keeping the file descriptors opened (potentially a big
> number). Otherwise, setns should fit your needs.
Makes sense.
One thing still confuses me...
The app control point is in namespace0. I still want to be able to
"boot" namespaces first and maybe a few seconds later do a socketat()...
and create devices, tcp sockets etc. I suspect create_ns(namespace-name)
would involve:
* open /proc/self/ns/net (namespace-name)
* unshare the netns
Is this correct?
cheers,
jamal
^ permalink raw reply
* URGENT REPLY
From: CEO Walter Partnership @ 2010-10-03 13:38 UTC (permalink / raw)
Dear Beneficiary,
INVESTIGATION ON YOUR PAYMENT FILE
Today investigation on your payment file and reason for delaying your
own payment approved since.
Please are you still alive/dead? Did you sign any dead assignment in
favor of this (John Smith) to receive your funds today?
This investigation bureau office has endorsed the actual time for the
final accomplishment of your payment.
Re-confirm to us immediately and receive your overdue approved
payment/fund as you fill this form below:
Name......................
Address...................
Nationality................
Age, Sex..................
Occupation.................
Company Name..............
Telephone Number..........
Drivers license/Int? Passport
Do reply immediately for confirmation and receive your final approved
funds of $2 million united state dollars.
Best regards,
Mr Christus Benard,
(Chief Investigation Officer).
^ permalink raw reply
* Re: [PATCHv3 net-next-2.6 3/5] XFRM,IPv6: Add IRO src/dst address remapping XFRM types and i/o handlers
From: Herbert Xu @ 2010-10-03 15:12 UTC (permalink / raw)
To: Arnaud Ebalard; +Cc: David Miller, eric.dumazet, yoshfuji, netdev
In-Reply-To: <87iq1j5r7z.fsf@small.ssi.corp>
On Sun, Oct 03, 2010 at 03:41:04PM +0200, Arnaud Ebalard wrote:
>
> After your reply, I took a (too long) look at the history of
> xfrm6_input_addr() to understand why it is as it is. If it can spare you
> some time, here is what I think happened:
...
> - Then, as you wrote, the state lock was moved in all input handlers
> (commit 0ebea8ef, Nov 13 2007), including RH2/HAO ones:
...
> With that commit, I think a deadlock was introduced in MIPv6 code
> because xfm6_input_addr() was left unchanged, i.e. x->type->input()
> was called with the lock held. Am I correct?
>
> - The code of xfrm6_input_addr() was then optimized by commit a002c6fd
> in such a way that x->type->input() was then put outside the
> protection of the lock, which (if I am not mistaken) removed the
> deadlock:
...
> I don't know if this is was intentional.
Indeed MIPv6 was completely out of action for three months and
nobody noticed :)
> But the main question remains on the position of the lock. Here,
> checks are done on the status of the state, lock is released,
> reacquired in the input handler to do additional check and then
> released again, to be reacquired later in the function to act on
> statistics. Is my reading of the code correct?
When I moved the lock down I chose the safest option and added
it to every single input function. So it may well be the case
that the lock isn't needed at all on the MIPv6 path.
Cheers,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply
* Re: sysctl_{tcp,udp,sctp}_mem overflow on 16TB system.
From: Willy Tarreau @ 2010-10-03 16:43 UTC (permalink / raw)
To: Maciej ??enczykowski
Cc: Robin Holt, David S. Miller, Alexey Kuznetsov,
Pekka Savola (ipv6), James Morris, Hideaki YOSHIFUJI,
Patrick McHardy, Vlad Yasevich, Sridhar Samudrala, linux-kernel,
netdev, linux-decnet-user, linux-sctp
In-Reply-To: <AANLkTin5wPvFQDFrupqGs_Jbh1rgrTjMupbPUFnvrBrv@mail.gmail.com>
On Sun, Oct 03, 2010 at 01:20:32AM -0700, Maciej ??enczykowski wrote:
> Isn't INT_MAX/2 just 1GB, which is only ~0.9 seconds at 10 Gbps?
no, the size is in pages (and is often wrong when people change it
from the default value).
Regards,
Willy
^ permalink raw reply
* [GIT] Networking
From: David Miller @ 2010-10-03 18:41 UTC (permalink / raw)
To: torvalds; +Cc: akpm, netdev, linux-kernel
1) The UM duplicate field regression fix, from Boaz Harrosh.
2) Max SYN retry sysctl doesn't end up generating the right number of
SYN retransmits due to bad calculations in retransmits_timed_out().
Fix from Damian Lukowski.
3) We erroneously drop VLAN packets when device is in promiscuous mode,
fix from Eric Dumazet.
4) Fix ip_gre Kconfig dependenices wrt. ipv6.
5) Phone uses stale protocol header pointer after pskb_may_pull() call,
fix from Kumar Sanghvi.
6) Use after free fix in mac80211 from Johannes Berg.
7) Fix work queueing in Intel wireless, from Florian Mickler.
Please pull, thanks a lot!
The following changes since commit c6ea21e35bf3691cad59647c771e6606067f627d:
Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 (2010-10-01 15:03:37 -0700)
are available in the git repository at:
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6.git master
Boaz Harrosh (1):
um: Proper Fix for f25c80a4: remove duplicate structure field initialization
Damian Lukowski (1):
net-2.6: SYN retransmits: Add new parameter to retransmits_timed_out()
David S. Miller (2):
ip_gre: Fix dependencies wrt. ipv6.
Merge branch 'master' of git://git.kernel.org/.../linville/wireless-2.6
Eric Dumazet (1):
vlan: dont drop packets from unknown vlans in promiscuous mode
Florian Mickler (1):
iwl3945: queue the right work if the scan needs to be aborted
Johannes Berg (1):
mac80211: fix use-after-free
Kumar Sanghvi (1):
Phonet: Correct header retrieval after pskb_may_pull
arch/um/drivers/net_kern.c | 17 +++--------------
drivers/net/wireless/iwlwifi/iwl-agn-lib.c | 2 +-
drivers/net/wireless/iwlwifi/iwl3945-base.c | 2 +-
net/8021q/vlan_core.c | 14 ++++++++++----
net/ipv4/Kconfig | 1 +
net/ipv4/tcp_timer.c | 24 ++++++++++++++----------
net/mac80211/rx.c | 4 ----
net/phonet/pep.c | 3 ++-
8 files changed, 32 insertions(+), 35 deletions(-)
^ permalink raw reply
* Reply Urgent Message
From: Huss Ejvet @ 2010-10-03 18:58 UTC (permalink / raw)
I am Ejvet a Manager in AKBank NV,I have a confidential proposal for you which I need you to execute with me.The sum of GBP£25,000,000.00(Twenty Five Million Pounds Sterling's)please provide me details if you are interested :
1.Full Name
2.Telephone Number
3.Contact Address
Thank
Huss Ejvet
^ permalink raw reply
* [PATCH] tms380tr: fix long delays in initialization
From: Meelis Roos @ 2010-10-03 19:34 UTC (permalink / raw)
To: netdev
tms380tr driver tries to use udelay (meaning busy loop) for several half
second delays during hardware initialization. Crazy overly long busy
wait delays mean no delay at all so driver initialization fails without
waiting. Fix it by using msleep() for long delays and leave it to
udelay() for short delays.
Signed-off-by: Meelis Roos <mroos@linux.ee>
---
drivers/net/tokenring/tms380tr.c | 32 ++++++++++++++++----------------
1 files changed, 16 insertions(+), 16 deletions(-)
diff --git a/drivers/net/tokenring/tms380tr.c b/drivers/net/tokenring/tms380tr.c
index aaec637..e6dc2df 100644
--- a/drivers/net/tokenring/tms380tr.c
+++ b/drivers/net/tokenring/tms380tr.c
@@ -177,6 +177,7 @@ static void tms380tr_update_rcv_stats(struct net_local *tp,
unsigned char DataPtr[], unsigned int Length);
/* "W" */
void tms380tr_wait(unsigned long time);
+static void tms380tr_wait_long(unsigned long time);
static void tms380tr_write_rpl_status(RPL *rpl, unsigned int Status);
static void tms380tr_write_tpl_status(TPL *tpl, unsigned int Status);
@@ -1216,20 +1217,19 @@ static void tms380tr_set_multicast_list(struct net_device *dev)
}
/*
- * Wait for some time (microseconds)
+ * Wait for some time (microseconds) - busy wait
*/
void tms380tr_wait(unsigned long time)
{
-#if 0
- long tmp;
-
- tmp = jiffies + time/(1000000/HZ);
- do {
- tmp = schedule_timeout_interruptible(tmp);
- } while(time_after(tmp, jiffies));
-#else
udelay(time);
-#endif
+}
+
+/*
+ * Wait for long time (microseconds) - schedule
+ */
+static void tms380tr_wait_long(unsigned long time)
+{
+ msleep(time / 1000);
}
/*
@@ -1352,9 +1352,9 @@ static int tms380tr_bringup_diags(struct net_device *dev)
int loop_cnt, retry_cnt;
unsigned short Status;
- tms380tr_wait(HALF_SECOND);
+ tms380tr_wait_long(HALF_SECOND);
tms380tr_exec_sifcmd(dev, EXEC_SOFT_RESET);
- tms380tr_wait(HALF_SECOND);
+ tms380tr_wait_long(HALF_SECOND);
retry_cnt = BUD_MAX_RETRIES; /* maximal number of retrys */
@@ -1365,7 +1365,7 @@ static int tms380tr_bringup_diags(struct net_device *dev)
loop_cnt = BUD_MAX_LOOPCNT; /* maximum: three seconds*/
do { /* Inspect BUD results */
loop_cnt--;
- tms380tr_wait(HALF_SECOND);
+ tms380tr_wait_long(HALF_SECOND);
Status = SIFREADW(SIFSTS);
Status &= STS_MASK;
@@ -1384,7 +1384,7 @@ static int tms380tr_bringup_diags(struct net_device *dev)
printk(KERN_INFO "%s: Adapter Software Reset.\n",
dev->name);
tms380tr_exec_sifcmd(dev, EXEC_SOFT_RESET);
- tms380tr_wait(HALF_SECOND);
+ tms380tr_wait_long(HALF_SECOND);
}
} while(retry_cnt > 0);
@@ -1457,7 +1457,7 @@ static int tms380tr_init_adapter(struct net_device *dev)
do {
Status = 0;
loop_cnt--;
- tms380tr_wait(HALF_SECOND);
+ tms380tr_wait_long(HALF_SECOND);
/* Mask interesting status bits */
Status = SIFREADW(SIFSTS);
@@ -1506,7 +1506,7 @@ static int tms380tr_init_adapter(struct net_device *dev)
{
/* Reset adapter and try init again */
tms380tr_exec_sifcmd(dev, EXEC_SOFT_RESET);
- tms380tr_wait(HALF_SECOND);
+ tms380tr_wait_long(HALF_SECOND);
}
}
}
--
1.7.1
--
Meelis Roos (mroos@linux.ee)
^ permalink raw reply related
* Reply Urgent Message
From: Huss Ejvet @ 2010-10-03 19:41 UTC (permalink / raw)
--
I am Ejvet a Manager in AKBank NV,I have a confidential proposal for you which I need you to execute with me.The sum of GBP£25,000,000.00(Twenty Five Million Pounds Sterling's)please provide me details if you are interested :
1.Full Name
2.Telephone Number
3.Contact Address
Thank
Huss Ejvet
^ permalink raw reply
* [PATCH] atmtcp: add sysfs attr for changing atm carrier signal state.
From: Karl Hiramoto @ 2010-10-03 19:51 UTC (permalink / raw)
To: netdev, linux-atm-general; +Cc: chas, davem, Karl Hiramoto
This will be used for device testing carrier signal changes in other parts of
the atm stack.
Carrier lost set by:
echo -n 0 > /sys/class/atm/atmtcp0/carrier
Carrier detected:
echo -n 1 > /sys/class/atm/atmtcp0/carrier
Signed-off-by: Karl Hiramoto <karl@hiramoto.org>
---
drivers/atm/atmtcp.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 59 insertions(+), 0 deletions(-)
diff --git a/drivers/atm/atmtcp.c b/drivers/atm/atmtcp.c
index b910181..7540e33 100644
--- a/drivers/atm/atmtcp.c
+++ b/drivers/atm/atmtcp.c
@@ -210,6 +210,18 @@ static int atmtcp_v_send(struct atm_vcc *vcc,struct sk_buff *skb)
atomic_inc(&vcc->stats->tx_err);
return -ENOLINK;
}
+
+ if (vcc->dev->signal == ATM_PHY_SIG_LOST) {
+ pr_warning(DEV_LABEL ": Dropping TX pkt, upper layer not handling carrier signal lost\n");
+ if (vcc->pop)
+ vcc->pop(vcc, skb);
+ else
+ dev_kfree_skb(skb);
+
+ atomic_inc(&vcc->stats->tx_err);
+ return -ENOLINK;
+ }
+
size = skb->len+sizeof(struct atmtcp_hdr);
new_skb = atm_alloc_charge(out_vcc,size,GFP_ATOMIC);
if (!new_skb) {
@@ -304,6 +316,12 @@ static int atmtcp_c_send(struct atm_vcc *vcc,struct sk_buff *skb)
atomic_inc(&vcc->stats->tx_err);
goto done;
}
+
+ if (out_vcc->dev->signal == ATM_PHY_SIG_LOST) {
+ pr_debug(DEV_LABEL ": Dropping RX pkt while no carrier signal\n");
+ result = -ENOLINK;
+ goto done;
+ }
skb_pull(skb,sizeof(struct atmtcp_hdr));
new_skb = atm_alloc_charge(out_vcc,skb->len,GFP_KERNEL);
if (!new_skb) {
@@ -356,6 +374,43 @@ static struct atm_dev atmtcp_control_dev = {
.lock = __SPIN_LOCK_UNLOCKED(atmtcp_control_dev.lock)
};
+static ssize_t __set_signal(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t len)
+{
+ struct atm_dev *atm_dev = container_of(dev, struct atm_dev, class_dev);
+ int signal;
+
+ if (sscanf(buf, "%d", &signal) == 1) {
+
+ if (signal < ATM_PHY_SIG_LOST || signal > ATM_PHY_SIG_FOUND)
+ signal = ATM_PHY_SIG_UNKNOWN;
+
+ atm_dev_signal_change(atm_dev, signal);
+ return 1;
+ }
+ return -EINVAL;
+}
+
+static ssize_t __show_signal(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct atm_dev *atm_dev = container_of(dev, struct atm_dev, class_dev);
+ return sprintf(buf, "%d\n", atm_dev->signal);
+}
+
+static DEVICE_ATTR(signal, 0644, __show_signal, __set_signal);
+
+static struct attribute *atmtcp_attrs[] = {
+ &dev_attr_signal.attr,
+ NULL
+};
+
+static struct attribute_group atmtcp_group_attrs = {
+ .name = NULL, /* We want them in dev's root folder */
+ .attrs = atmtcp_attrs
+};
+
static int atmtcp_create(int itf,int persist,struct atm_dev **result)
{
@@ -376,6 +431,10 @@ static int atmtcp_create(int itf,int persist,struct atm_dev **result)
dev->dev_data = dev_data;
PRIV(dev)->vcc = NULL;
PRIV(dev)->persist = persist;
+
+ if (sysfs_create_group(&dev->class_dev.kobj, &atmtcp_group_attrs))
+ dev_err(&dev->class_dev, "Could not register sysfs attrs for atmtcp\n");
+
if (result) *result = dev;
return 0;
}
--
1.7.2.2
^ permalink raw reply related
* Re: [PATCH] atmtcp: add sysfs attr for changing atm carrier signal state.
From: David Miller @ 2010-10-03 20:00 UTC (permalink / raw)
To: karl; +Cc: netdev, linux-atm-general, chas
In-Reply-To: <1286135485-27355-1-git-send-email-karl@hiramoto.org>
From: Karl Hiramoto <karl@hiramoto.org>
Date: Sun, 3 Oct 2010 21:51:25 +0200
> This will be used for device testing carrier signal changes in other parts of
> the atm stack.
>
> Carrier lost set by:
> echo -n 0 > /sys/class/atm/atmtcp0/carrier
>
> Carrier detected:
> echo -n 1 > /sys/class/atm/atmtcp0/carrier
>
> Signed-off-by: Karl Hiramoto <karl@hiramoto.org>
We already have enough sysfs hacky knobs.
And for something as fundamental as setting the carrier status, it's
not acceptable to add a device-type specific control.
Use something we have already, either via core device state management
(via netlink or ifconfig's ioctls) or ethtool.
Thanks.
^ permalink raw reply
* 2.6.36-rc6-git2: Reported regressions from 2.6.35
From: Rafael J. Wysocki @ 2010-10-03 21:15 UTC (permalink / raw)
To: Linux Kernel Mailing List
Cc: Linux SCSI List, Linux ACPI, Network Development,
Linux Wireless List, DRI, Florian Mickler, Andrew Morton,
Kernel Testers List, Linus Torvalds, Linux PM List,
Maciej Rutecki
This message contains a list of some regressions from 2.6.35,
for which there are no fixes in the mainline known to the tracking team.
If any of them have been fixed already, please let us know.
If you know of any other unresolved regressions from 2.6.35, please let us
know either and we'll add them to the list. Also, please let us know
if any of the entries below are invalid.
Each entry from the list will be sent additionally in an automatic reply
to this message with CCs to the people involved in reporting and handling
the issue.
Listed regressions statistics:
Date Total Pending Unresolved
----------------------------------------
2010-10-03 52 16 14
2010-09-26 46 15 13
2010-09-20 38 15 15
2010-09-12 28 14 13
2010-08-30 21 16 15
Unresolved regressions
----------------------
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=19642
Subject : 2.6.36-rc6 BUG at drivers/scsi/scsi_lib.c:1113
Submitter : George Spelvin <linux@horizon.com>
Date : 2010-09-30 21:10 (4 days old)
Message-ID : <20100930211006.27449.qmail@science.horizon.com>
References : http://marc.info/?l=linux-kernel&m=128588102620299&w=2
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=19632
Subject : 2.6.36-rc6: modprobe Not tainted warning
Submitter : Heinz Diehl <htd@fritha.org>
Date : 2010-09-30 18:25 (4 days old)
Message-ID : <20100930182516.GA15089@fritha.org>
References : http://marc.info/?l=linux-kernel&m=128587114004680&w=2
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=19392
Subject : WARNING: at drivers/net/wireless/ath/ath5k/base.c:3475 ath5k_bss_info_changed+0x44/0x168 [ath5k]()
Submitter : Justin Mattock <justinmattock@gmail.com>
Date : 2010-09-28 22:30 (6 days old)
Message-ID : <AANLkTim5WCGKPvEkOkO_YnMF9pg8mvLfQoFBNUFpfa_k@mail.gmail.com>
References : http://marc.info/?l=linux-kernel&m=128571307018635&w=2
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=19372
Subject : 2.6.36-rc6: WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:235 radeon_fence_wait+0x35a/0x3c0
Submitter : Alexey Dobriyan <adobriyan@gmail.com>
Date : 2010-09-29 21:29 (5 days old)
Message-ID : <20100929212923.GA5578@core2.telecom.by>
References : http://marc.info/?l=linux-kernel&m=128579579400315&w=2
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=19142
Subject : Screen flickers when switching from the console to X
Submitter : Andrey Rahmatullin <wrar@altlinux.org>
Date : 2010-09-27 12:05 (7 days old)
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=19072
Subject : [2.6.36-rc regression] occasional complete system hangs on sparc64 SMP
Submitter : Mikael Pettersson <mikpe@it.uu.se>
Date : 2010-09-23 17:02 (11 days old)
Message-ID : <19611.34846.813757.309183@pilspetsen.it.uu.se>
References : http://marc.info/?l=linux-kernel&m=128526136531048&w=2
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=19062
Subject : Dirtiable inode bdi default != sb bdi btrfs
Submitter : Cesar Eduardo Barros <cesarb@cesarb.net>
Date : 2010-09-23 0:54 (11 days old)
Message-ID : <<4C9AA546.6050201@cesarb.net>>
References : http://marc.info/?l=linux-kernel&m=128520328929595&w=2
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=19052
Subject : 2.6.36-rc5-git1 -- [drm:i915_report_and_clear_eir] *ERROR* EIR stuck: 0x00000010, masking
Submitter : Miles Lane <miles.lane@gmail.com>
Date : 2010-09-22 23:47 (12 days old)
Message-ID : <AANLkTikWQjUQjFJU9MO1+XbSLAEE-GARz+S+Dz2Fgu4h@mail.gmail.com>
References : http://marc.info/?l=linux-kernel&m=128519926626322&w=2
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=19002
Subject : Radeon rv730 AGP/KMS/DRM kernel lockup
Submitter : Duncan <1i5t5.duncan@cox.net>
Date : 2010-09-23 16:48 (11 days old)
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=17361
Subject : Watchdog detected hard LOCKUP in jbd2_journal_get_write_access
Submitter : Christian Casteyde <casteyde.christian@free.fr>
Date : 2010-08-29 19:59 (36 days old)
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=17121
Subject : Two blank rectangles more than 10 cm long when booting
Submitter : Eric Valette <eric.valette@free.fr>
Date : 2010-08-26 17:24 (39 days old)
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=17061
Subject : 2.6.36-rc1 on zaurus: bluetooth regression
Submitter : Pavel Machek <pavel@ucw.cz>
Date : 2010-08-21 15:24 (44 days old)
Message-ID : <20100821152445.GA1536@ucw.cz>
References : http://marc.info/?l=linux-kernel&m=128240433828087&w=2
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=16971
Subject : qla4xxx compile failure on 32-bit PowerPC: missing readq and writeq
Submitter : Meelis Roos <mroos@linux.ee>
Date : 2010-08-19 21:03 (46 days old)
Message-ID : <alpine.SOC.1.00.1008192359310.19654@math.ut.ee>
References : http://marc.info/?l=linux-kernel&m=128225184900892&w=2
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=16951
Subject : hackbench regression with 2.6.36-rc1
Submitter : Zhang, Yanmin <yanmin_zhang@linux.intel.com>
Date : 2010-08-18 6:18 (47 days old)
Message-ID : <1282112318.21202.8.camel@ymzhang.sh.intel.com>
References : http://marc.info/?l=linux-kernel&m=128211235904910&w=2
Regressions with patches
------------------------
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=18742
Subject : PROBLEM: Kernel panic on 2.6.36-rc4 when loading intel_ips on Core i3 laptop
Submitter : infernix <infernix@infernix.net>
Date : 2010-09-15 14:35 (19 days old)
Message-ID : <4C90D998.6050103@infernix.net>
References : http://marc.info/?l=linux-kernel&m=128456187928496&w=2
Handled-By : Jesse Barnes <jbarnes@virtuousgeek.org>
Patch : https://bugzilla.kernel.org/attachment.cgi?id=31112
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=17722
Subject : 2.6.36-rc3: WARNING: at net/mac80211/scan.c:269 ieee80211_scan_completed
Submitter : Thomas Meyer <thomas@m3y3r.de>
Date : 2010-08-31 20:14 (34 days old)
Message-ID : <201008312214.52473.thomas@m3y3r.de>
References : http://marc.info/?l=linux-kernel&m=128328580504227&w=2
http://www.spinics.net/lists/netdev/msg140769.html
Handled-By : Florian Mickler <florian@mickler.org>
Patch : https://bugzilla.kernel.org/attachment.cgi?id=31671
For details, please visit the bug entries and follow the links given in
references.
As you can see, there is a Bugzilla entry for each of the listed regressions.
There also is a Bugzilla entry used for tracking the regressions from 2.6.35,
unresolved as well as resolved, at:
http://bugzilla.kernel.org/show_bug.cgi?id=16444
Please let the tracking team know if there are any Bugzilla entries that
should be added to the list in there.
Thanks!
^ permalink raw reply
* Re: [PATCHv3 net-next-2.6 3/5] XFRM,IPv6: Add IRO src/dst address remapping XFRM types and i/o handlers
From: Arnaud Ebalard @ 2010-10-03 21:25 UTC (permalink / raw)
To: Herbert Xu; +Cc: David Miller, eric.dumazet, yoshfuji, netdev
In-Reply-To: <20101003151202.GA11963@gondor.apana.org.au>
Hello,
Herbert Xu <herbert@gondor.apana.org.au> writes:
> On Sun, Oct 03, 2010 at 03:41:04PM +0200, Arnaud Ebalard wrote:
>>
>> After your reply, I took a (too long) look at the history of
>> xfrm6_input_addr() to understand why it is as it is. If it can spare you
>> some time, here is what I think happened:
>
> ...
>
>> - Then, as you wrote, the state lock was moved in all input handlers
>> (commit 0ebea8ef, Nov 13 2007), including RH2/HAO ones:
>
> ...
>
>> With that commit, I think a deadlock was introduced in MIPv6 code
>> because xfm6_input_addr() was left unchanged, i.e. x->type->input()
>> was called with the lock held. Am I correct?
>>
>> - The code of xfrm6_input_addr() was then optimized by commit a002c6fd
>> in such a way that x->type->input() was then put outside the
>> protection of the lock, which (if I am not mistaken) removed the
>> deadlock:
>
> ...
>
>> I don't know if this is was intentional.
>
> Indeed MIPv6 was completely out of action for three months and
> nobody noticed :)
hehe ;-) Just to correct a missing waypoint in my history, which is in
fact the real fix for the deadlock:
commit 9473e1f631de339c50bde1e3bd09e1045fe90fd5
Author: Masahide NAKAMURA <nakam@linux-ipv6.org>
Date: Thu Dec 20 20:41:57 2007 -0800
[XFRM] MIPv6: Fix to input RO state correctly.
Disable spin_lock during xfrm_type.input() function.
Follow design as IPsec inbound does.
Signed-off-by: Masahide NAKAMURA <nakam@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
>> But the main question remains on the position of the lock. Here,
>> checks are done on the status of the state, lock is released,
>> reacquired in the input handler to do additional check and then
>> released again, to be reacquired later in the function to act on
>> statistics. Is my reading of the code correct?
>
> When I moved the lock down I chose the safest option and added
> it to every single input function. So it may well be the case
> that the lock isn't needed at all on the MIPv6 path.
I don't have any technical argument to support the removal of the locks,
i.e. don't see what would prevent changes during the check. I will try
and spend more time on it, but meanwhile I think it's safe to keep
things the way they are.
Thanks for your time, Herbert.
Cheers,
a+
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox