* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: Eric Dumazet @ 2010-04-29 13:21 UTC (permalink / raw)
To: hadi
Cc: Changli Gao, David Miller, therbert, shemminger, netdev,
Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272547061.4258.174.camel@bigi>
Le jeudi 29 avril 2010 à 09:17 -0400, jamal a écrit :
> Could we have some stat in there that shows IPIs being produced? I think
> it would help to at least observe any changes over variety of tests.
> I did try to patch my system during the first few tests to record IPIs
> but it seems to make more sense to have it as a perf stat.
>
> > Even with 200.000 IPI per second, 'perf top -C CPU_IPI_sender' shows
> > that sending IPI is very cheap (maybe ~1% of cpu cycles)
> >
> > # Samples: 32033467127
> > #
>
> One thing i observed is our profiles seem different. Could you send me
> your .config for a single nehalem and i will try to go as close as
> possible to it? I have a sky2 instead of bnx - but i suspect everything
> else will be very similar...
> I apologize i dont have much time to look into details - but what i can
> do is test at least.
I'am going to redo some test on my 'old machine', with tg3 driver.
You could try following program :
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
struct softnet_stat_vals {
int flip;
unsigned int tab[2][10];
};
int read_file(struct softnet_stat_vals *v)
{
char buffer[1024];
FILE *F = fopen("/proc/net/softnet_stat", "r");
v->flip ^= 1;
if (!F)
return -1;
memset(v->tab[v->flip], 0, 10 * sizeof(unsigned int));
while (fgets(buffer, sizeof(buffer), F)) {
int i, pos = 0;
unsigned int val;
for (i = 0; ;) {
if (sscanf(buffer + pos, "%08x", &val) != 1) break;
v->tab[v->flip][i] += val;
pos += 9;
if (++i == 10)
break;
}
}
fclose(F);
}
int main(int argc, char *argv[])
{
struct softnet_stat_vals *v = calloc(sizeof(struct softnet_stat_vals), 1);
read_file(v);
for (;;) {
sleep(1);
read_file(v);
printf("%u rps\n", v->tab[v->flip][9] - v->tab[v->flip^1][9]);
}
}
^ permalink raw reply
* Re: [PATCH] tcp: SO_TIMESTAMP implementation for TCP
From: David Miller @ 2010-04-30 6:39 UTC (permalink / raw)
To: therbert; +Cc: netdev
In-Reply-To: <alpine.DEB.1.00.1004292246440.12776@pokey.mtv.corp.google.com>
From: Tom Herbert <therbert@google.com>
Date: Thu, 29 Apr 2010 23:07:54 -0700 (PDT)
> Implement SO_TIMESTAMP{NS} for TCP. When this socket option is enabled
> on a TCP socket, a timestamp for received data can be returned in the
> ancillary data of a recvmsg with control message type SCM_TIMESTAMP{NS}.
> The timestamp chosen is that of the skb most recently received from
> which data was copied. This is useful in debugging and timing
> network operations.
>
> Signed-off-by: Tom Herbert <therbert@google.com>
That's not what you're implementing here.
You're only updating the socket timestamp if the SKB passed into
the update function has a more recent timestamp.
There is nothing that says the timestamps have to be increasing and
with retransmits and such if it were me I would want to see the real
timestamp even if it was earlier than the most recently reported
timestamp.
I don't know, I really don't like this feature at all. SO_TIMESTAMP
is really meant for datagram oriented sockets, where there is a
clearly defined "packet" whose timestamp you get. A TCP receive can
involve hundreds of tiny packets so the timestamp can lack any
meaning.
All these new checks and branches for a feature of questionable value.
If you can modify you apps to grab this information you can also probe
for the information using external probing tools.
Sorry, I don't think I'll be applying this.
^ permalink raw reply
* Re: [PATCH 1/3] ptp: Added a brand new class driver for ptp clocks.
From: Wolfgang Grandegger @ 2010-04-30 17:58 UTC (permalink / raw)
To: Richard Cochran; +Cc: netdev
In-Reply-To: <20100429091936.GA6703@riccoc20.at.omicron.at>
Hi Richard,
Richard Cochran wrote:
> This patch adds an infrastructure for hardware clocks that implement
> IEEE 1588, the Precision Time Protocol (PTP). A class driver offers a
> registration method to particular hardware clock drivers. Each clock is
> exposed to user space as a character device with ioctls that allow tuning
> of the PTP clock.
>
> Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
> ---
> Documentation/ptp/ptp.txt | 78 ++++++++++
> Documentation/ptp/testptp.c | 130 ++++++++++++++++
> Documentation/ptp/testptp.mk | 33 ++++
> drivers/Kconfig | 2 +
> drivers/Makefile | 1 +
> drivers/ptp/Kconfig | 26 ++++
> drivers/ptp/Makefile | 5 +
> drivers/ptp/ptp_clock.c | 302 ++++++++++++++++++++++++++++++++++++++
> include/linux/Kbuild | 1 +
> include/linux/ptp_clock.h | 37 +++++
ptp_clock.h should probably be added to "include/linux/Kbuild".
Wolfgang.
^ permalink raw reply
* Re: ixgbe and mac-vlans problem
From: Arnd Bergmann @ 2010-04-30 18:00 UTC (permalink / raw)
To: Ben Greear; +Cc: NetDev, Patrick McHardy
In-Reply-To: <4BDA07DB.8020206@candelatech.com>
On Friday 30 April 2010 00:27:39 Ben Greear wrote:
> Basically, we create 50 mac-vlans, with sequential MAC addresses and sequential
> IP addresses, and set up ip rules properly.
>
> The issue is that only 10 or so of the mac-vlans receive other than
> broadcast packets. The ixgbe NIC doesn't show PROMISC mode.
I just took a brief look at the driver and noticed that 82599 should
be able to handle 128 entries before going into promisc mode, while
82598 (the same driver) does 16.
Maybe the logic for >16 entries is wrong, so you could try forcing
hw->mac.num_rar_entries to 16 for 82599 as well.
Arnd
^ permalink raw reply
* Re: [PATCH linux-next v2 1/2] irq: Add CPU mask affinity hint
From: Peter P Waskiewicz Jr @ 2010-04-30 18:02 UTC (permalink / raw)
To: Thomas Gleixner
Cc: davem@davemloft.net, arjan@linux.jf.intel.com,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <alpine.LFD.2.00.1004301249540.2951@localhost.localdomain>
On Fri, 30 Apr 2010, Thomas Gleixner wrote:
> On Fri, 30 Apr 2010, Peter P Waskiewicz Jr wrote:
>
>> This patch adds a cpumask affinity hint to the irq_desc
>> structure, along with a registration function and a read-only
>> proc entry for each interrupt.
>>
>> This affinity_hint handle for each interrupt can be used by
>> underlying drivers that need a better mechanism to control
>> interrupt affinity. The underlying driver can register a
>> cpumask for the interrupt, which will allow the driver to
>> provide the CPU mask for the interrupt to anything that
>> requests it. The intent is to extend the userspace daemon,
>> irqbalance, to help hint to it a preferred CPU mask to balance
>> the interrupt into.
>>
>> Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
>> ---
>>
>> include/linux/interrupt.h | 13 +++++++++++++
>> include/linux/irq.h | 1 +
>> kernel/irq/manage.c | 28 ++++++++++++++++++++++++++++
>> kernel/irq/proc.c | 33 +++++++++++++++++++++++++++++++++
>> 4 files changed, 75 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
>> index 75f3f00..9c9ea2a 100644
>> --- a/include/linux/interrupt.h
>> +++ b/include/linux/interrupt.h
>> @@ -209,6 +209,9 @@ extern int irq_set_affinity(unsigned int irq, const struct cpumask *cpumask);
>> extern int irq_can_set_affinity(unsigned int irq);
>> extern int irq_select_affinity(unsigned int irq);
>>
>> +extern int irq_register_affinity_hint(unsigned int irq,
>> + const struct cpumask *m);
>
> I think we can do with a single funtion irq_set_affinity_hint() and
> let the caller set the pointer to NULL.
Ok, I've been running into some issues. If CONFIG_CPUMASK_OFFSTACK is not
set, then cpumask_var_t structs are single-element arrays that cannot be
NULL'd out. I'm pretty sure I need to keep the unregister part of the
API. Thoughts?
>> + raw_spin_lock_irqsave(&desc->lock, flags);
>> + if (desc->affinity_hint) {
>> + seq_cpumask(m, desc->affinity_hint);
>
> Please make a local copy under desc->mask and do the seq_cpumask()
> stuff on the local copy outside of desc->lock
I just looked at the original show_affinity function, and it does not grab
desc->lock before copying mask out of desc. Should I follow that model,
or should I fix that function to honor desc->lock?
-PJ
^ permalink raw reply
* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: jamal @ 2010-04-29 20:36 UTC (permalink / raw)
To: Eric Dumazet
Cc: Changli Gao, David Miller, therbert, shemminger, netdev,
Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272549408.4258.189.camel@bigi>
[-- Attachment #1: Type: text/plain, Size: 738 bytes --]
On Thu, 2010-04-29 at 09:56 -0400, jamal wrote:
>
> I will try your program instead so we can reduce the variables
Results attached.
With your app rps does a hell lot better and non-rps worse ;->
With my proggie, non-rps does much better than yours and rps does
a lot worse for same setup. I see the scheduler kicking quiet a bit in
non-rps for you...
The main difference between us as i see it is:
a) i use epoll - actually linked to libevent (1.0.something)
b) I fork processes and you use pthreads.
I dont have time to chase it today, but 1) I am either going to change
yours to use libevent or make mine get rid of it then 2) move towards
pthreads or have yours fork..
then observe if that makes any difference..
cheers,
jamal
[-- Attachment #2: apr29-res.txt --]
[-- Type: text/plain, Size: 29074 bytes --]
No RPS; same kernel as yesterday with Eric's changes
-------------------------------------------------------------------------------
PerfTop: 2572 irqs/sec kernel:94.7% [1000Hz cycles], (all, 8 CPUs)
-------------------------------------------------------------------------------
samples pcnt function DSO
_______ _____ ___________________________ ________
2901.00 17.4% sky2_poll [sky2]
781.00 4.7% schedule [kernel]
574.00 3.4% __skb_recv_datagram [kernel]
518.00 3.1% _raw_spin_lock_irqsave [kernel]
460.00 2.8% udp_recvmsg [kernel]
457.00 2.7% copy_user_generic_string [kernel]
397.00 2.4% _raw_spin_lock_bh [kernel]
340.00 2.0% __udp4_lib_lookup [kernel]
320.00 1.9% ip_route_input [kernel]
295.00 1.8% _raw_spin_lock [kernel]
293.00 1.8% dst_release [kernel]
282.00 1.7% ip_rcv [kernel]
275.00 1.6% skb_copy_datagram_iovec [kernel]
263.00 1.6% __switch_to [kernel]
257.00 1.5% __alloc_skb [kernel]
256.00 1.5% system_call [kernel]
243.00 1.5% sock_recv_ts_and_drops [kernel]
227.00 1.4% sock_queue_rcv_skb [kernel]
225.00 1.3% _raw_spin_unlock_irqrestore [kernel]
220.00 1.3% fget_light [kernel]
218.00 1.3% pick_next_task_fair [kernel]
-------------------------------------------------------------------------------
PerfTop: 1000 irqs/sec kernel:100.0% [1000Hz cycles], (all, cpu: 0)
-------------------------------------------------------------------------------
samples pcnt function DSO
_______ _____ ___________________________ ________
1508.00 37.9% sky2_poll [sky2]
198.00 5.0% ip_route_input [kernel]
184.00 4.6% __udp4_lib_lookup [kernel]
172.00 4.3% ip_rcv [kernel]
139.00 3.5% _raw_spin_lock [kernel]
131.00 3.3% __alloc_skb [kernel]
130.00 3.3% sock_queue_rcv_skb [kernel]
111.00 2.8% __udp4_lib_rcv [kernel]
101.00 2.5% __netif_receive_skb [kernel]
78.00 2.0% select_task_rq_fair [kernel]
74.00 1.9% try_to_wake_up [kernel]
73.00 1.8% sock_def_readable [kernel]
72.00 1.8% _raw_spin_lock_irqsave [kernel]
67.00 1.7% task_rq_lock [kernel]
66.00 1.7% _raw_read_lock [kernel]
64.00 1.6% __kmalloc [kernel]
62.00 1.6% resched_task [kernel]
61.00 1.5% sky2_rx_submit [sky2]
52.00 1.3% ip_local_deliver [kernel]
51.00 1.3% kmem_cache_alloc [kernel]
51.00 1.3% swiotlb_sync_single [kernel]
43.00 1.1% sky2_remove [sky2]
41.00 1.0% udp_queue_rcv_skb [kernel]
39.00 1.0% __wake_up_common [kernel]
-------------------------------------------------------------------------------
PerfTop: 368 irqs/sec kernel:95.9% [1000Hz cycles], (all, cpu: 1)
-------------------------------------------------------------------------------
samples pcnt function DSO
_______ _____ ___________________________ ________
279.00 8.2% schedule [kernel]
260.00 7.7% __skb_recv_datagram [kernel]
196.00 5.8% _raw_spin_lock_bh [kernel]
180.00 5.3% copy_user_generic_string [kernel]
176.00 5.2% udp_recvmsg [kernel]
150.00 4.4% _raw_spin_lock_irqsave [kernel]
142.00 4.2% dst_release [kernel]
106.00 3.1% skb_copy_datagram_iovec [kernel]
97.00 2.9% sock_recv_ts_and_drops [kernel]
93.00 2.7% tick_nohz_stop_sched_tick [kernel]
89.00 2.6% sys_recvfrom [kernel]
89.00 2.6% __switch_to [kernel]
86.00 2.5% pick_next_task_fair [kernel]
82.00 2.4% sock_rfree [kernel]
75.00 2.2% system_call [kernel]
73.00 2.2% fget_light [kernel]
70.00 2.1% _raw_spin_lock_irq [kernel]
63.00 1.9% kmem_cache_free [kernel]
61.00 1.8% _raw_spin_unlock_irqrestore [kernel]
60.00 1.8% kfree [kernel]
56.00 1.7% select_nohz_load_balancer [kernel]
55.00 1.6% finish_task_switch [kernel]
48.00 1.4% inet_recvmsg [kernel]
41.00 1.2% security_socket_recvmsg [kernel]
-------------------------------------------------------------------------------
PerfTop: 97 irqs/sec kernel:81.4% [1000Hz cycles], (all, cpu: 7)
-------------------------------------------------------------------------------
samples pcnt function DSO
_______ _____ ____________________________ ________
55.00 10.8% schedule [kernel]
38.00 7.5% __skb_recv_datagram [kernel]
36.00 7.1% udp_recvmsg [kernel]
32.00 6.3% _raw_spin_lock_irqsave [kernel]
31.00 6.1% _raw_spin_lock_bh [kernel]
30.00 5.9% copy_user_generic_string [kernel]
29.00 5.7% sock_recv_ts_and_drops [kernel]
27.00 5.3% skb_copy_datagram_iovec [kernel]
17.00 3.3% system_call [kernel]
17.00 3.3% dst_release [kernel]
14.00 2.7% _raw_spin_unlock_irqrestore [kernel]
12.00 2.4% __switch_to [kernel]
12.00 2.4% pick_next_task_fair [kernel]
11.00 2.2% inet_recvmsg [kernel]
11.00 2.2% sys_recvfrom [kernel]
10.00 2.0% finish_task_switch [kernel]
10.00 2.0% sock_rfree [kernel]
10.00 2.0% select_nohz_load_balancer [kernel]
7.00 1.4% rcu_enter_nohz [kernel]
7.00 1.4% tick_nohz_stop_sched_tick [kernel]
7.00 1.4% tick_nohz_restart_sched_tick [kernel]
5.00 1.0% ktime_get [kernel]
Run1
----
557257 pps (557257 0:69750 1:69417 2:69063 3:68818 4:70139 5:69824 6:70135 7:70113)
737468 pps (1294725 0:162765 1:162430 2:162075 3:155770 4:163150 5:162838 6:163150 7:162549)
744238 pps (2038963 0:255795 1:255460 2:255105 3:248800 4:256180 5:255867 6:256180 7:255579)
719343 pps (2758306 0:348825 1:348202 2:348135 3:338166 4:349210 5:333030 6:349210 7:343528)
741830 pps (3500136 0:440870 1:440933 2:441165 3:430162 4:442240 5:425970 6:442240 7:436558)
686289 pps (4186425 0:533900 1:533749 2:515637 3:511486 4:531997 5:504717 6:525536 7:529406)
681708 pps (4868133 0:613701 1:617409 2:608667 3:599774 4:607480 5:589487 6:609802 7:621817)
697577 pps (5565710 0:704183 1:710439 2:688904 3:681696 4:689120 5:673932 6:702448 7:714988)
729284 pps (6294994 0:797213 1:803469 2:775863 3:770959 4:781160 5:766105 6:792207 7:808018)
734160 pps (7029154 0:886389 1:896504 2:868898 3:863506 4:868426 5:859138 6:885242 7:901053)
728541 pps (7757695 0:978789 1:989534 2:961928 3:946834 4:961458 5:952170 6:978272 7:988714)
709578 pps (8467273 0:1071819 1:1079000 2:1041101 3:1038974 4:1047215 5:1037254 6:1070168 7:1081744)
684154 pps (9151427 0:1160855 1:1158471 2:1122874 3:1129012 4:1136563 5:1120258 6:1153624 7:1169773)
498291 pps (9649718 0:1224303 1:1214178 2:1185737 3:1191467 4:1200058 5:1183753 6:1217121 7:1233101)
Essentially sink in about 96.5% of 10M packet
run2
---
402553 pps (402553 0:51530 1:53289 2:53625 3:45748 4:53625 5:49484 6:42292 7:52960)
711539 pps (1114092 0:144028 1:146426 2:144237 3:124551 4:146760 5:142619 6:119376 7:146095)
692319 pps (1806411 0:208285 1:239557 2:220103 3:211096 4:239890 5:235749 6:212506 7:239225)
731896 pps (2538307 0:301450 1:332723 2:308718 3:304264 4:333055 5:320036 6:305671 7:332390)
712869 pps (3251176 0:393270 1:418806 2:397578 3:396844 4:426245 5:406943 6:398861 7:412629)
681513 pps (3932689 0:486300 1:501926 2:490613 3:489874 4:466455 5:499973 6:491891 7:505659)
697308 pps (4629997 0:567969 1:585032 2:583643 3:576712 4:548243 5:589399 6:581080 7:597922)
712903 pps (5342900 0:657579 1:660221 2:676673 3:669744 4:641273 5:682222 6:674110 7:681082)
687765 pps (6030665 0:744421 1:752470 2:764631 3:751445 4:722250 5:771799 6:761224 7:762426)
695799 pps (6726464 0:832438 1:842797 2:853337 3:844470 4:804427 5:857412 6:846918 7:844668)
720011 pps (7446475 0:925210 1:934696 2:934883 3:937280 4:894644 5:949883 6:932740 7:937142)
712021 pps (8158496 0:1017246 1:1027726 2:1016841 3:1024712 4:978513 5:1042913 6:1023516 7:1027031)
709810 pps (8868306 0:1098522 1:1111823 2:1109871 3:1117444 4:1070124 5:1131774 6:1109841 7:1118909)
591817 pps (9460123 0:1178005 1:1185698 2:1189381 3:1196367 4:1143880 5:1198406 6:1176121 7:1192265)
94.6%
run3
---
682714 pps (682714 0:83336 1:86683 2:86895 3:86243 4:84616 5:81152 6:86895 7:86895)
691212 pps (1373926 0:164602 1:179240 2:171897 3:174162 4:176509 5:158115 6:174083 7:175321)
661913 pps (2035839 0:243004 1:263829 2:259312 3:267160 4:268875 5:231009 6:253411 7:249239)
715612 pps (2751451 0:336034 1:350220 2:346461 3:360190 4:359219 5:317625 6:346441 7:335265)
655354 pps (3406805 0:419339 1:434934 2:432010 3:442138 4:437837 5:394805 6:427064 7:418679)
592126 pps (3998931 0:494253 1:511454 2:508829 3:511992 4:508978 5:474866 6:496884 7:491679)
697177 pps (4696108 0:584474 1:601703 2:589111 3:602252 4:598767 5:565114 6:582153 7:572539)
681004 pps (5377112 0:662864 1:684427 2:678825 3:688402 4:685441 5:651962 6:673697 7:651495)
669622 pps (6046734 0:740275 1:765126 2:762764 3:773772 4:772144 5:731330 6:762339 7:738987)
645906 pps (6692640 0:825606 1:850550 2:846793 3:858243 4:850408 5:812402 6:838248 7:810391)
705873 pps (7398513 0:916877 1:937693 2:929956 3:950433 4:938179 5:894913 6:928125 7:902337)
735460 pps (8133973 0:1009907 1:1030722 2:1022986 3:1037959 4:1031209 5:987943 6:1021155 7:992092)
707605 pps (8841578 0:1102933 1:1122367 2:1101160 3:1129212 4:1124239 5:1063617 6:1112929 7:1085122)
347807 pps (9189385 0:1149677 1:1168026 2:1147905 3:1170556 4:1158858 5:1110362 6:1152134 7:1131867)
91.9%
run4
----
552606 pps (552606 0:72743 1:75411 2:67732 3:70204 4:63741 5:64934 6:66096 7:71746)
684450 pps (1237056 0:162839 1:165064 2:148974 3:160417 4:153919 5:135895 6:156238 7:153710)
696799 pps (1933855 0:254440 1:252304 2:240107 3:249399 4:246028 5:228009 6:247409 7:216161)
676546 pps (2610401 0:341132 1:336959 2:325332 3:330438 4:336250 5:305238 6:336208 7:298848)
712251 pps (3322652 0:432976 1:428990 2:413228 3:419977 4:425918 5:386917 6:426275 7:388371)
615680 pps (3938332 0:515679 1:497421 2:491618 3:505449 4:489452 5:462820 6:505336 7:470561)
635467 pps (4573799 0:597340 1:582917 2:555389 3:582751 4:573273 5:545378 6:584378 7:552373)
725581 pps (5299380 0:690038 1:675870 2:636347 3:676029 4:666231 5:632208 6:677337 7:645324)
699015 pps (5998395 0:783068 1:763654 2:725184 3:762784 4:752559 5:709123 6:764439 7:737586)
674472 pps (6672867 0:872645 1:847669 2:808333 3:827766 4:842267 5:798997 6:853779 7:821412)
680913 pps (7353780 0:961487 1:926760 2:887273 3:919158 4:925165 5:891082 6:929793 7:913064)
666279 pps (8020059 0:1050823 1:1012028 2:972691 3:988738 4:1009904 5:974127 6:1017940 7:993808)
680615 pps (8700674 0:1124223 1:1087779 2:1057541 3:1080546 4:1094373 5:1066880 6:1102496 7:1086838)
420306 pps (9120980 0:1177541 1:1130287 2:1111621 3:1134624 4:1148453 5:1120960 6:1156576 7:1140918)
91.2%
run5
------
294229 pps (294229 0:38805 1:30946 2:32655 3:36613 4:38805 5:38805 6:38800 7:38801)
694748 pps (988977 0:124394 1:123976 2:114107 3:128079 4:111317 5:131835 6:131835 7:123434)
690185 pps (1679162 0:217405 1:216988 2:194192 3:204091 4:195948 5:224678 6:220924 7:204937)
726561 pps (2405723 0:307828 1:309671 2:278163 3:296811 4:286642 5:317346 6:311296 7:297967)
695974 pps (3101697 0:391228 1:395256 2:371056 3:388790 4:379533 5:410242 6:393051 7:372541)
665395 pps (3767092 0:473134 1:484367 2:447394 3:462837 4:471026 5:491170 6:473947 7:463219)
671483 pps (4438575 0:562883 1:574014 2:534258 3:544512 4:534064 5:581420 6:560073 7:547353)
679400 pps (5117975 0:641135 1:663809 2:618019 3:633448 4:605085 5:674433 6:649865 7:632183)
696263 pps (5814238 0:734516 1:743715 2:711049 3:717481 4:693193 5:758493 6:740374 7:715417)
681791 pps (6496029 0:823596 1:836004 2:795579 3:809104 4:783457 5:820061 6:820219 7:808010)
670672 pps (7166701 0:911202 1:927618 2:888127 3:875504 4:874363 5:889342 6:911838 7:888707)
743444 pps (7910145 0:1004233 1:1020652 2:981157 3:968534 4:967393 5:982078 6:1004362 7:981737)
725623 pps (8635768 0:1096546 1:1113682 2:1059978 3:1061564 4:1060423 5:1072761 6:1097392 7:1073423)
662504 pps (9298272 0:1171688 1:1197579 2:1137559 3:1154595 4:1146405 5:1161670 6:1176001 7:1152776)
12979 pps (9311251 0:1173488 1:1199379 2:1137914 3:1156399 4:1148209 5:1163475 6:1177806 7:1154581)
93.1%
Average for no-rps 93.5% of 10M incoming at ~ 750Kpps.
# echo 1 > /proc/irq/55/smp_affinity
# echo ee > /sys/class/net/eth0/queues/rx-0/rps_cpus
-------------------------------------------------------------------------------
PerfTop: 2273 irqs/sec kernel:93.7% [1000Hz cycles], (all, 8 CPUs)
-------------------------------------------------------------------------------
samples pcnt function DSO
_______ _____ ______________________________ ________
922.00 10.3% sky2_poll [sky2]
402.00 4.5% __netif_receive_skb [kernel]
400.00 4.4% ip_rcv [kernel]
356.00 4.0% call_function_single_interrupt [kernel]
339.00 3.8% ip_route_input [kernel]
282.00 3.1% schedule [kernel]
194.00 2.2% _raw_spin_lock_irqsave [kernel]
180.00 2.0% sock_recv_ts_and_drops [kernel]
178.00 2.0% _raw_spin_lock [kernel]
173.00 1.9% __udp4_lib_lookup [kernel]
171.00 1.9% __udp4_lib_rcv [kernel]
162.00 1.8% system_call [kernel]
154.00 1.7% kfree [kernel]
147.00 1.6% __skb_recv_datagram [kernel]
146.00 1.6% copy_user_generic_string [kernel]
136.00 1.5% dst_release [kernel]
136.00 1.5% _raw_spin_unlock_irqrestore [kernel]
126.00 1.4% fget_light [kernel]
126.00 1.4% sky2_intr [sky2]
122.00 1.4% udp_recvmsg [kernel]
111.00 1.2% sock_queue_rcv_skb [kernel]
-------------------------------------------------------------------------------
PerfTop: 325 irqs/sec kernel:93.2% [1000Hz cycles], (all, cpu: 0)
-------------------------------------------------------------------------------
samples pcnt function DSO
_______ _____ ___________________________________ ________
1033.00 62.9% sky2_poll [sky2]
159.00 9.7% sky2_intr [sky2]
119.00 7.3% irq_entries_start [kernel]
51.00 3.1% __alloc_skb [kernel]
48.00 2.9% get_rps_cpu [kernel]
24.00 1.5% __kmalloc [kernel]
23.00 1.4% swiotlb_sync_single [kernel]
20.00 1.2% _raw_spin_lock [kernel]
17.00 1.0% sky2_rx_submit [sky2]
15.00 0.9% enqueue_to_backlog [kernel]
14.00 0.9% kmem_cache_alloc [kernel]
11.00 0.7% default_send_IPI_mask_sequence_phys [kernel]
10.00 0.6% sky2_remove [sky2]
10.00 0.6% cache_alloc_refill [kernel]
8.00 0.5% _raw_spin_lock_irqsave [kernel]
7.00 0.4% dev_gro_receive [kernel]
6.00 0.4% net_rx_action [kernel]
6.00 0.4% __netdev_alloc_skb [kernel]
6.00 0.4% load_balance [kernel]
5.00 0.3% __smp_call_function_single [kernel]
-------------------------------------------------------------------------------
PerfTop: 347 irqs/sec kernel:96.3% [1000Hz cycles], (all, cpu: 1)
-------------------------------------------------------------------------------
samples pcnt function DSO
_______ _____ ______________________________ ________
104.00 6.7% call_function_single_interrupt [kernel]
104.00 6.7% __netif_receive_skb [kernel]
95.00 6.1% ip_rcv [kernel]
93.00 6.0% ip_route_input [kernel]
62.00 4.0% schedule [kernel]
49.00 3.2% sock_recv_ts_and_drops [kernel]
46.00 3.0% system_call [kernel]
46.00 3.0% dst_release [kernel]
45.00 2.9% _raw_spin_lock [kernel]
41.00 2.7% _raw_spin_lock_irqsave [kernel]
40.00 2.6% _raw_spin_unlock_irqrestore [kernel]
36.00 2.3% copy_user_generic_string [kernel]
34.00 2.2% __udp4_lib_rcv [kernel]
30.00 1.9% fget_light [kernel]
30.00 1.9% sock_queue_rcv_skb [kernel]
28.00 1.8% udp_recvmsg [kernel]
28.00 1.8% __udp4_lib_lookup [kernel]
26.00 1.7% select_task_rq_fair [kernel]
25.00 1.6% tick_nohz_stop_sched_tick [kernel]
23.00 1.5% __napi_complete [kernel]
20.00 1.3% __switch_to [kernel]
20.00 1.3% finish_task_switch [kernel]
20.00 1.3% kmem_cache_free [kernel]
20.00 1.3% sys_recvfrom [kernel]
19.00 1.2% kfree [kernel]
19.00 1.2% __skb_recv_datagram [kernel]
-------------------------------------------------------------------------------
PerfTop: 243 irqs/sec kernel:95.5% [1000Hz cycles], (all, cpu: 7)
-------------------------------------------------------------------------------
samples pcnt function DSO
_______ _____ ______________________________ ________
92.00 7.3% ip_rcv [kernel]
74.00 5.9% __netif_receive_skb [kernel]
57.00 4.6% ip_route_input [kernel]
49.00 3.9% sock_recv_ts_and_drops [kernel]
49.00 3.9% system_call [kernel]
47.00 3.8% schedule [kernel]
39.00 3.1% _raw_spin_lock_irqsave [kernel]
36.00 2.9% call_function_single_interrupt [kernel]
34.00 2.7% udp_recvmsg [kernel]
32.00 2.6% __udp4_lib_rcv [kernel]
31.00 2.5% copy_user_generic_string [kernel]
31.00 2.5% fget_light [kernel]
30.00 2.4% __udp4_lib_lookup [kernel]
26.00 2.1% kfree [kernel]
25.00 2.0% __skb_recv_datagram [kernel]
25.00 2.0% sock_queue_rcv_skb [kernel]
23.00 1.8% __switch_to [kernel]
22.00 1.8% sock_recvmsg [kernel]
22.00 1.8% _raw_spin_unlock_irqrestore [kernel]
21.00 1.7% select_task_rq_fair [kernel]
18.00 1.4% _raw_spin_lock [kernel]
17.00 1.4% process_backlog [kernel]
17.00 1.4% sys_recvfrom [kernel]
17.00 1.4% _raw_spin_lock_bh [kernel]
run1
----
590479 pps (590479 0:73820 1:73817 2:73820 3:73819 4:73815 5:73815 6:73815 7:73815)
744641 pps (1335120 0:166895 1:166895 2:166895 3:166895 4:166895 5:166895 6:166895 7:166895)
744374 pps (2079494 0:259940 1:259940 2:259940 3:259940 4:259940 5:259940 6:259940 7:259940)
744340 pps (2823834 0:352985 1:352985 2:352985 3:352985 4:352985 5:352985 6:352980 7:352985)
744390 pps (3568224 0:446035 1:446035 2:446035 3:446035 4:446035 5:446035 6:446032 7:446030)
744404 pps (4312628 0:539085 1:539085 2:539085 3:539081 4:539085 5:539085 6:539085 7:539085)
744369 pps (5056997 0:632130 1:632130 2:632130 3:632130 4:632130 5:632130 6:632130 7:632130)
744394 pps (5801391 0:725180 1:725180 2:725180 3:725180 4:725180 5:725180 6:725180 7:725180)
744399 pps (6545790 0:818230 1:818230 2:818229 3:818230 4:818230 5:818226 6:818225 7:818225)
744354 pps (7290144 0:911275 1:911275 2:911275 3:911275 4:911270 5:911270 6:911270 7:911270)
744363 pps (8034507 0:1004320 1:1004320 2:1004320 3:1004320 4:1004320 5:1004306 6:1004320 7:1004317)
744379 pps (8778886 0:1097370 1:1097368 2:1097370 3:1097370 4:1097370 5:1097356 6:1097367 7:1097365)
744449 pps (9523335 0:1190425 1:1190425 2:1190425 3:1190421 4:1190425 5:1190411 6:1190425 7:1190425)
476651 pps (9999986 0:1250000 1:1250000 2:1250000 3:1250000 4:1250000 5:1249986 6:1250000 7:1250000)
99.9% !
rps counter..
865721 rps
1067721 rps
run2
----
573759 pps (573759 0:71720 1:71720 2:71720 3:71723 4:71721 5:71720 6:71720 7:71719)
744249 pps (1318008 0:164755 1:164753 2:164750 3:164750 4:164750 5:164750 6:164750 7:164750)
744260 pps (2062268 0:257785 1:257785 2:257785 3:257785 4:257785 5:257783 6:257780 7:257780)
744238 pps (2806506 0:350815 1:350815 2:350815 3:350815 4:350815 5:350811 6:350810 7:350810)
744233 pps (3550739 0:443845 1:443845 2:443845 3:443845 4:443844 5:443841 6:443841 7:443840)
744236 pps (4294975 0:536875 1:536875 2:536875 3:536870 4:536870 5:536870 6:536870 7:536870)
744244 pps (5039219 0:629905 1:629905 2:629905 3:629905 4:629905 5:629901 6:629901 7:629900)
744240 pps (5783459 0:722935 1:722935 2:722935 3:722934 4:722930 5:722930 6:722930 7:722930)
744214 pps (6527673 0:815962 1:815960 2:815965 3:815963 4:815962 5:815960 6:815955 7:815955)
744268 pps (7271941 0:908995 1:908995 2:908995 3:908995 4:908991 5:908990 6:908990 7:908990)
744239 pps (8016180 0:1002025 1:1002025 2:1002025 3:1002025 4:1002020 5:1002020 6:1002020 7:1002020)
744241 pps (8760421 0:1095055 1:1095055 2:1095052 3:1095055 4:1095055 5:1095050 6:1095050 7:1095050)
744234 pps (9504655 0:1188085 1:1188085 2:1188084 3:1188085 4:1188085 5:1188081 6:1188080 7:1188080)
495345 pps (10000000 0:1250000 1:1250000 2:1250000 3:1250000 4:1250000 5:1250000 6:1250000 7:1250000)
100.0% !!!
rps count ..
3651 rps
1455997 rps
498777 rps
run3
----
72947 pps (72947 0:9120 1:9120 2:9120 3:9120 4:9120 5:9117 6:9115 7:9115)
744616 pps (817563 0:102198 1:102195 2:102195 3:102195 4:102195 5:102195 6:102195 7:102195)
744710 pps (1562273 0:195285 1:195285 2:195285 3:195285 4:195285 5:195285 6:195285 7:195283)
744478 pps (2306751 0:288345 1:288345 2:288345 3:288345 4:288345 5:288345 6:288341 7:288340)
744603 pps (3051354 0:381422 1:381420 2:381420 3:381414 4:381420 5:381420 6:381420 7:381420)
744475 pps (3795829 0:474480 1:474480 2:474480 3:474472 4:474480 5:474480 6:474480 7:474477)
744740 pps (4540569 0:567575 1:567575 2:567575 3:567564 4:567570 5:567570 6:567570 7:567570)
744641 pps (5285210 0:660655 1:660655 2:660655 3:660646 4:660650 5:660650 6:660650 7:660650)
744300 pps (6029510 0:753695 1:753690 2:753690 3:753682 4:753690 5:753690 6:753690 7:753690)
744249 pps (6773759 0:846725 1:846725 2:846725 3:846712 4:846720 5:846720 6:846720 7:846720)
744709 pps (7518468 0:939814 1:939810 2:939810 3:939802 4:939810 5:939810 6:939810 7:939810)
744647 pps (8263115 0:1032893 1:1032890 2:1032890 3:1032882 4:1032890 5:1032890 6:1032890 7:1032890)
744672 pps (9007787 0:1125976 1:1125975 2:1125975 3:1125967 4:1125975 5:1125975 6:1125975 7:1125970)
744692 pps (9752479 0:1219065 1:1219065 2:1219062 3:1219056 4:1219060 5:1219060 6:1219060 7:1219060)
247513 pps (9999992 0:1250000 1:1250000 2:1250000 3:1249992 4:1250000 5:1250000 6:1250000 7:1250000)
99.9%!
rps count ...
1118484 rps
842940 rps
run4
----
288558 pps (288558 0:36070 1:36070 2:36070 3:36070 4:36070 5:36070 6:36070 7:36068)
744237 pps (1032795 0:129103 1:129100 2:129105 3:129100 4:129100 5:129100 6:129095 7:129095)
742988 pps (1775783 0:222135 1:222135 2:222135 3:222135 4:220853 5:222130 6:222130 7:222130)
744210 pps (2519993 0:315160 1:315160 2:315160 3:315160 4:313883 5:315160 6:315155 7:315155)
744214 pps (3264207 0:408189 1:408185 2:408185 3:408185 4:406908 5:408185 6:408185 7:408185)
744278 pps (4008485 0:501223 1:501220 2:501220 3:501220 4:499943 5:501220 6:501220 7:501220)
743699 pps (4752184 0:594252 1:594250 2:593718 3:594250 4:592973 5:594250 6:594248 7:594245)
744243 pps (5496427 0:687280 1:687280 2:686748 3:687280 4:686003 5:687280 6:687280 7:687276)
744231 pps (6240658 0:780310 1:780310 2:779778 3:780310 4:779033 5:780300 6:780310 7:780307)
743958 pps (6984616 0:873342 1:873340 2:872808 3:873340 4:872063 5:873043 6:873340 7:873340)
744241 pps (7728857 0:966373 1:966370 2:965838 3:966370 4:965093 5:966073 6:966370 7:966370)
744232 pps (8473089 0:1059400 1:1059400 2:1058868 3:1059400 4:1058123 5:1059103 6:1059397 7:1059398)
743660 pps (9216749 0:1152434 1:1152430 2:1151898 3:1152430 4:1151153 5:1151556 6:1152427 7:1152430)
744251 pps (9961000 0:1245463 1:1245460 2:1244928 3:1245460 4:1244183 5:1244586 6:1245460 7:1245460)
36317 pps (9997317 0:1250000 1:1250000 2:1249468 3:1250000 4:1248723 5:1249126 6:1250000 7:1250000)
99.9%!
rps count
818552 rps
1146570 rps
run 5
----
686211 pps (686211 0:85780 1:85780 2:85775 3:85779 4:85780 5:85780 6:85775 7:85775)
744260 pps (1430471 0:178810 1:178810 2:178810 3:178810 4:178810 5:178810 6:178806 7:178805)
744242 pps (2174713 0:271840 1:271840 2:271840 3:271840 4:271840 5:271840 6:271838 7:271835)
744241 pps (2918954 0:364870 1:364870 2:364870 3:364870 4:364870 5:364870 6:364869 7:364865)
744238 pps (3663192 0:457900 1:457900 2:457900 3:457900 4:457900 5:457900 6:457900 7:457899)
744240 pps (4407432 0:550930 1:550930 2:550930 3:550930 4:550930 5:550930 6:550927 7:550925)
744244 pps (5151676 0:643960 1:643960 2:643960 3:643960 4:643960 5:643960 6:643960 7:643956)
744236 pps (5895912 0:736990 1:736990 2:736990 3:736990 4:736990 5:736990 6:736987 7:736985)
744241 pps (6640153 0:830020 1:830020 2:830020 3:830020 4:830020 5:830020 6:830018 7:830015)
744235 pps (7384388 0:923050 1:923050 2:923050 3:923050 4:923050 5:923049 6:923045 7:923047)
744244 pps (8128632 0:1016080 1:1016080 2:1016080 3:1016080 4:1016080 5:1016080 6:1016079 7:1016075)
744231 pps (8872863 0:1109110 1:1109110 2:1109110 3:1109110 4:1109108 5:1109105 6:1109105 7:1109105)
744258 pps (9617121 0:1202141 1:1202140 2:1202140 3:1202140 4:1202140 5:1202140 6:1202140 7:1202140)
382879 pps (10000000 0:1250000 1:1250000 2:1250000 3:1250000 4:1250000 5:1250000 6:1250000 7:1250000)
100%
rpsipi count ..
768383 rps
1178132 rps
^ permalink raw reply
* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: jamal @ 2010-04-29 13:56 UTC (permalink / raw)
To: Eric Dumazet
Cc: Changli Gao, David Miller, therbert, shemminger, netdev,
Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272548980.2222.87.camel@edumazet-laptop>
On Thu, 2010-04-29 at 15:49 +0200, Eric Dumazet wrote:
> > I fork one instance per detected cpu and bind to different ports each
> > time. Example bind to port 8200 on cpu0, 8201 on cpu1, etc.
> >
>
> I guess this is the problem ;)
>
> With RPS, you should not bind your threads to cpu.
> This is the rps hash who will decide for you.
>
Sorry - I was not clear; i have the option of binding to cpu
vs the setsched api; but what i meant in this case is:
- for each cpu detected, fork
-- open socket
---bind to udp port cpu# + 8200
I could also bind to a cpu in the last step and i did notice it
improved distribution - but all my tests since apr23 dont do that ;->
>
> I am using following program :
>
I will try your program instead so we can reduce the variables
cheers,
jamal
^ permalink raw reply
* Re: [PATCH 0/3] [RFC] ptp: IEEE 1588 clock support
From: Richard Cochran @ 2010-04-29 6:54 UTC (permalink / raw)
To: Wolfgang Grandegger; +Cc: netdev
In-Reply-To: <4BD846C7.1050006@grandegger.com>
On Wed, Apr 28, 2010 at 04:31:35PM +0200, Wolfgang Grandegger wrote:
> That's because some 1588_PPS related bits are not yet setup in the
> platform code of mainline kernel.
So did you get it working?
I am reworking this patch set to post again, but perhaps you might
take a look at the patch below. It configures the gianfar PTP clock
parameters via the device tree.
Richard
[PATCH] ptp: gianfar clock uses device tree parameters
Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
---
arch/powerpc/boot/dts/mpc8313erdb.dts | 14 +++++
arch/powerpc/boot/dts/p2020ds.dts | 13 ++++
arch/powerpc/boot/dts/p2020rdb.dts | 14 +++++
drivers/net/gianfar_ptp.c | 102 ++++++++++++++++++++++----------
4 files changed, 111 insertions(+), 32 deletions(-)
diff --git a/arch/powerpc/boot/dts/mpc8313erdb.dts b/arch/powerpc/boot/dts/mpc8313erdb.dts
index 183f2aa..b760aee 100644
--- a/arch/powerpc/boot/dts/mpc8313erdb.dts
+++ b/arch/powerpc/boot/dts/mpc8313erdb.dts
@@ -208,6 +208,20 @@
sleep = <&pmc 0x00300000>;
};
+ ptp_clock@24E00 {
+ device_type = "ptp_clock";
+ model = "eTSEC";
+ reg = <0x24E00 0xB0>;
+ interrupts = <0x0C 2 0x0D 2>;
+ interrupt-parent = < &ipic >;
+ tclk_period = <10>;
+ tmr_prsc = <100>;
+ tmr_add = <0x999999A4>;
+ cksel = <0x1>;
+ tmr_fiper1 = <0x3B9AC9F6>;
+ tmr_fiper2 = <0x00018696>;
+ };
+
enet0: ethernet@24000 {
#address-cells = <1>;
#size-cells = <1>;
diff --git a/arch/powerpc/boot/dts/p2020ds.dts b/arch/powerpc/boot/dts/p2020ds.dts
index 1101914..1dcf790 100644
--- a/arch/powerpc/boot/dts/p2020ds.dts
+++ b/arch/powerpc/boot/dts/p2020ds.dts
@@ -336,6 +336,19 @@
phy_type = "ulpi";
};
+ ptp_clock@24E00 {
+ device_type = "ptp_clock";
+ model = "eTSEC";
+ reg = <0x24E00 0xB0>;
+ interrupts = <0x0C 2 0x0D 2>;
+ interrupt-parent = < &mpic >;
+ tclk_period = <5>;
+ tmr_prsc = <200>;
+ tmr_add = <0xCCCCCCCD>;
+ tmr_fiper1 = <0x3B9AC9FB>;
+ tmr_fiper2 = <0x0001869B>;
+ };
+
enet0: ethernet@24000 {
#address-cells = <1>;
#size-cells = <1>;
diff --git a/arch/powerpc/boot/dts/p2020rdb.dts b/arch/powerpc/boot/dts/p2020rdb.dts
index da4cb0d..ba61e8e 100644
--- a/arch/powerpc/boot/dts/p2020rdb.dts
+++ b/arch/powerpc/boot/dts/p2020rdb.dts
@@ -396,6 +396,20 @@
phy_type = "ulpi";
};
+ ptp_clock@24E00 {
+ device_type = "ptp_clock";
+ model = "eTSEC";
+ reg = <0x24E00 0xB0>;
+ interrupts = <0x0C 2 0x0D 2>;
+ interrupt-parent = < &mpic >;
+ tclk_period = <5>;
+ tmr_prsc = <200>;
+ tmr_add = <0xCCCCCCCD>;
+ cksel = <1>;
+ tmr_fiper1 = <0x3B9AC9FB>;
+ tmr_fiper2 = <0x0001869B>;
+ };
+
enet0: ethernet@24000 {
#address-cells = <1>;
#size-cells = <1>;
diff --git a/drivers/net/gianfar_ptp.c b/drivers/net/gianfar_ptp.c
index eed3246..ed6234c 100644
--- a/drivers/net/gianfar_ptp.c
+++ b/drivers/net/gianfar_ptp.c
@@ -22,6 +22,8 @@
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/module.h>
+#include <linux/of.h>
+#include <linux/of_platform.h>
#include <linux/timex.h>
#include <asm/io.h>
@@ -29,29 +31,16 @@
#include "gianfar_ptp_reg.h"
-/*
- *
- * TODO - get the following from device tree
- *
- */
-#define TMR_BASE_KERNEL 0xe0024e00 // CONFIG_PPC_85xx 0xffe24e00
-#define TIMER_OSC 166666666
-#define TCLK_PERIOD 10
-#define NOMINAL_FREQ 100000000
-#define DEF_TMR_PRSC 100
-#define DEF_TMR_ADD 0x999999A4
-#define DEFAULT_CKSEL 1
-
#define REG_SIZE (4 + TMR_ETTS2_L)
struct etsects {
void *regs;
- u32 timer_osc; /* Hz */
u32 tclk_period; /* nanoseconds */
- s64 nominal_freq; /* Hz */
u32 tmr_prsc;
u32 tmr_add;
u32 cksel;
+ u32 tmr_fiper1;
+ u32 tmr_fiper2;
};
/* Private globals */
@@ -111,8 +100,8 @@ static void set_fipers(struct etsects *etsects)
reg_write(etsects, TMR_CTRL, tmr_ctrl & (~TE));
reg_write(etsects, TMR_PRSC, etsects->tmr_prsc);
- reg_write(etsects, TMR_FIPER1, 0x3B9AC9F6);
- reg_write(etsects, TMR_FIPER2, 0x00018696);
+ reg_write(etsects, TMR_FIPER1, etsects->tmr_fiper1);
+ reg_write(etsects, TMR_FIPER2, etsects->tmr_fiper2);
set_alarm(etsects);
reg_write(etsects, TMR_CTRL, tmr_ctrl|TE);
}
@@ -213,34 +202,51 @@ struct ptp_clock_info ptp_gianfar_caps = {
.enable = ptp_gianfar_enable,
};
-/* module operations */
+/* OF device tree */
-static void __exit ptp_gianfar_exit(void)
+static int get_of_u32(struct device_node *node, char *str, u32 *val)
{
- ptp_clock_unregister(&ptp_gianfar_caps);
- iounmap(the_clock.regs);
+ int plen;
+ const u32 *prop = of_get_property(node, str, &plen);
+
+ if (!prop || plen != sizeof(*prop))
+ return -1;
+ *val = *prop;
+ return 0;
}
-static int __init ptp_gianfar_init(void)
+static int gianfar_ptp_probe(struct of_device* dev,
+ const struct of_device_id *match)
{
+ u64 addr, size;
+ struct device_node *node = dev->node;
struct etsects *etsects = &the_clock;
struct timespec now;
- phys_addr_t reg_addr = TMR_BASE_KERNEL;
- unsigned long reg_size = REG_SIZE;
+ phys_addr_t reg_addr;
+ unsigned long reg_size;
u32 tmr_ctrl;
int err;
+ if (get_of_u32(node, "tclk_period", &etsects->tclk_period) ||
+ get_of_u32(node, "tmr_prsc", &etsects->tmr_prsc) ||
+ get_of_u32(node, "tmr_add", &etsects->tmr_add) ||
+ get_of_u32(node, "cksel", &etsects->cksel) ||
+ get_of_u32(node, "tmr_fiper1", &etsects->tmr_fiper1) ||
+ get_of_u32(node, "tmr_fiper2", &etsects->tmr_fiper2))
+ return -ENODEV;
+
+ addr = of_translate_address(node, of_get_address(node, 0, &size, NULL));
+ reg_addr = addr;
+ reg_size = size;
+ if (reg_size < REG_SIZE) {
+ pr_warning("device tree reg range %lu too small\n", reg_size);
+ reg_size = REG_SIZE;
+ }
etsects->regs = ioremap(reg_addr, reg_size);
if (!etsects->regs) {
pr_err("ioremap ptp registers failed\n");
return -EINVAL;
}
- etsects->timer_osc = TIMER_OSC;
- etsects->tclk_period = TCLK_PERIOD;
- etsects->nominal_freq = NOMINAL_FREQ;
- etsects->tmr_prsc = DEF_TMR_PRSC;
- etsects->tmr_add = DEF_TMR_ADD;
- etsects->cksel = DEFAULT_CKSEL;
tmr_ctrl =
(etsects->tclk_period & TCLK_PERIOD_MASK) << TCLK_PERIOD_SHIFT |
@@ -252,8 +258,8 @@ static int __init ptp_gianfar_init(void)
reg_write(etsects, TMR_CTRL, tmr_ctrl);
reg_write(etsects, TMR_ADD, etsects->tmr_add);
reg_write(etsects, TMR_PRSC, etsects->tmr_prsc);
- reg_write(etsects, TMR_FIPER1, 0x3B9AC9F6);
- reg_write(etsects, TMR_FIPER2, 0x00018696);
+ reg_write(etsects, TMR_FIPER1, etsects->tmr_fiper1);
+ reg_write(etsects, TMR_FIPER2, etsects->tmr_fiper2);
set_alarm(etsects);
reg_write(etsects, TMR_CTRL, tmr_ctrl|FS|RTPE|TE);
@@ -261,6 +267,38 @@ static int __init ptp_gianfar_init(void)
return err;
}
+static int gianfar_ptp_remove(struct of_device* dev)
+{
+ ptp_clock_unregister(&ptp_gianfar_caps);
+ iounmap(the_clock.regs);
+ return 0;
+}
+
+static struct of_device_id match_table[] = {
+ { .type = "ptp_clock" },
+ {},
+};
+
+static struct of_platform_driver gianfar_ptp_driver = {
+ .name = "gianfar_ptp",
+ .match_table = match_table,
+ .owner = THIS_MODULE,
+ .probe = gianfar_ptp_probe,
+ .remove = gianfar_ptp_remove,
+};
+
+/* module operations */
+
+static void __exit ptp_gianfar_exit(void)
+{
+ of_unregister_platform_driver(&gianfar_ptp_driver);
+}
+
+static int __init ptp_gianfar_init(void)
+{
+ return of_register_platform_driver(&gianfar_ptp_driver);
+}
+
subsys_initcall(ptp_gianfar_init);
module_exit(ptp_gianfar_exit);
--
1.6.0.4
^ permalink raw reply related
* [PATCH net-next-2.6] net: speedup sock_recv_ts_and_drops()
From: Eric Dumazet @ 2010-04-29 5:14 UTC (permalink / raw)
To: David Miller; +Cc: netdev
sock_recv_ts_and_drops() is fat and slow (~ 4% of cpu time on some
profiles)
We can test all socket flags at once to make fast path fast again.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
include/net/sock.h | 19 ++++++++++++++++++-
net/socket.c | 4 ++--
2 files changed, 20 insertions(+), 3 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index d361c77..e1777db 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1635,7 +1635,24 @@ sock_recv_timestamp(struct msghdr *msg, struct sock *sk, struct sk_buff *skb)
sk->sk_stamp = kt;
}
-extern void sock_recv_ts_and_drops(struct msghdr *msg, struct sock *sk, struct sk_buff *skb);
+extern void __sock_recv_ts_and_drops(struct msghdr *msg, struct sock *sk,
+ struct sk_buff *skb);
+
+static inline void sock_recv_ts_and_drops(struct msghdr *msg, struct sock *sk,
+ struct sk_buff *skb)
+{
+#define FLAGS_TS_OR_DROPS ((1UL << SOCK_RXQ_OVFL) | \
+ (1UL << SOCK_RCVTSTAMP) | \
+ (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE) | \
+ (1UL << SOCK_TIMESTAMPING_SOFTWARE) | \
+ (1UL << SOCK_TIMESTAMPING_RAW_HARDWARE) | \
+ (1UL << SOCK_TIMESTAMPING_SYS_HARDWARE))
+
+ if (sk->sk_flags & FLAGS_TS_OR_DROPS)
+ __sock_recv_ts_and_drops(msg, sk, skb);
+ else
+ sk->sk_stamp = skb->tstamp;
+}
/**
* sock_tx_timestamp - checks whether the outgoing packet is to be time stamped
diff --git a/net/socket.c b/net/socket.c
index 9822081..cb7c1f6 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -655,13 +655,13 @@ inline void sock_recv_drops(struct msghdr *msg, struct sock *sk, struct sk_buff
sizeof(__u32), &skb->dropcount);
}
-void sock_recv_ts_and_drops(struct msghdr *msg, struct sock *sk,
+void __sock_recv_ts_and_drops(struct msghdr *msg, struct sock *sk,
struct sk_buff *skb)
{
sock_recv_timestamp(msg, sk, skb);
sock_recv_drops(msg, sk, skb);
}
-EXPORT_SYMBOL_GPL(sock_recv_ts_and_drops);
+EXPORT_SYMBOL_GPL(__sock_recv_ts_and_drops);
static inline int __sock_recvmsg_nosec(struct kiocb *iocb, struct socket *sock,
struct msghdr *msg, size_t size, int flags)
^ permalink raw reply related
* pull request: wireless-2.6 2010-04-30
From: John W. Linville @ 2010-04-30 18:08 UTC (permalink / raw)
To: davem-fT/PcQaiUtIeIZ0/mPfg9Q
Cc: linux-wireless-u79uwXL29TY76Z2rM5mHXA,
netdev-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
Dave,
One more for 2.6.34...it avoids some DMA mapping-related failures.
Please let me know if there are problems!
Thanks,
John
---
The following changes since commit 03f80cc3f24e1dcdbdba081ed5daf5575aac6180:
Sebastian Siewior (1):
net/sb1250: register mdio bus in probe
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git master
Hans de Goede (1):
p54pci: fix bugs in p54p_check_tx_ring
drivers/net/wireless/p54/p54pci.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/drivers/net/wireless/p54/p54pci.c b/drivers/net/wireless/p54/p54pci.c
index ed4bdff..21f673d 100644
--- a/drivers/net/wireless/p54/p54pci.c
+++ b/drivers/net/wireless/p54/p54pci.c
@@ -245,7 +245,7 @@ static void p54p_check_tx_ring(struct ieee80211_hw *dev, u32 *index,
u32 idx, i;
i = (*index) % ring_limit;
- (*index) = idx = le32_to_cpu(ring_control->device_idx[1]);
+ (*index) = idx = le32_to_cpu(ring_control->device_idx[ring_index]);
idx %= ring_limit;
while (i != idx) {
--
John W. Linville Someday the world will need a hero, and you
linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org might be all we have. Be ready.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related
* Re: [PATCH 0/3] [RFC] ptp: IEEE 1588 clock support
From: Wolfgang Grandegger @ 2010-04-29 9:24 UTC (permalink / raw)
To: Richard Cochran; +Cc: netdev
In-Reply-To: <20100429083833.GA4629@riccoc20.at.omicron.at>
Richard Cochran wrote:
> On Wed, Apr 28, 2010 at 04:31:35PM +0200, Wolfgang Grandegger wrote:
>> That's because some 1588_PPS related bits are not yet setup in the
>> platform code of mainline kernel.
>
> Just remembered, I am carrying along the following patch to fix the
> wrong mainline code for the mpc8313. Really annoying.
OK.
> Richard
>
>>From 4306b6f89e5565928b4462fd8cff19a3e484f1c4 Mon Sep 17 00:00:00 2001
> From: Richard Cochran <richard.cochran@omicron.at>
> Date: Tue, 6 Apr 2010 13:36:32 +0200
> Subject: [PATCH] mpc8313: fixed the board support for REV C
>
> ---
> arch/powerpc/boot/dts/mpc8313erdb.dts | 56 ++++++++++++++++++++++------
> arch/powerpc/platforms/83xx/mpc831x_rdb.c | 15 ++++++++
> 2 files changed, 59 insertions(+), 12 deletions(-)
>
> diff --git a/arch/powerpc/boot/dts/mpc8313erdb.dts b/arch/powerpc/boot/dts/mpc8313erdb.dts
> index 761faa7..183f2aa 100644
> --- a/arch/powerpc/boot/dts/mpc8313erdb.dts
> +++ b/arch/powerpc/boot/dts/mpc8313erdb.dts
> @@ -70,6 +70,26 @@
> reg = <0x0 0x0 0x800000>;
> bank-width = <2>;
> device-width = <1>;
> + partition@0 {
> + label = "U-Boot";
> + reg = <0x00000000 0x00100000>;
> + };
> + partition@100000 {
> + label = "kernel";
> + reg = <0x00100000 0x00200000>;
> + };
> + partition@300000 {
> + label = "rootfs";
> + reg = <0x00300000 0x00400000>;
> + };
> + partition@700000 {
> + label = "DTB";
> + reg = <0x00700000 0x00010000>;
> + };
> + partition@710000 {
> + label = "vsc-util";
> + reg = <0x00710000 0x000F0000>;
> + };
> };
>
> nand@1,0 {
> @@ -78,19 +98,31 @@
> compatible = "fsl,mpc8313-fcm-nand",
> "fsl,elbc-fcm-nand";
> reg = <0x1 0x0 0x2000>;
> -
> - u-boot@0 {
> - reg = <0x0 0x100000>;
> - read-only;
> + partition@0 {
> + label = "U-Boot-NAND";
> + reg = <0x00000000 0x00100000>;
> };
> -
> - kernel@100000 {
> - reg = <0x100000 0x300000>;
> + partition@100000 {
> + label = "JFFS2-NAND";
> + reg = <0x00100000 0x00800000>;
> };
> -
> - fs@400000 {
> - reg = <0x400000 0x1c00000>;
> + partition@900000 {
> + label = "Ramdisk-NAND";
> + reg = <0x00900000 0x00400000>;
> + };
> + partition@d00000 {
> + label = "Reserve-NAND";
> + reg = <0x00d00000 0x01000000>;
> };
> + partition@1d00000 {
> + label = "Kernel-NAND";
> + reg = <0x01d00000 0x00200000>;
> + };
> + partition@1f00000 {
> + label = "DTB-NAND";
> + reg = <0x01f00000 0x00100000>;
> + };
> +
> };
> };
>
> @@ -188,7 +220,7 @@
> compatible = "gianfar";
> reg = <0x24000 0x1000>;
> local-mac-address = [ 00 00 00 00 00 00 ];
> - interrupts = <37 0x8 36 0x8 35 0x8>;
> + interrupts = <32 0x8 33 0x8 34 0x8>;
> interrupt-parent = <&ipic>;
> tbi-handle = < &tbi0 >;
> /* Vitesse 7385 isn't on the MDIO bus */
> @@ -223,7 +255,7 @@
> reg = <0x25000 0x1000>;
> ranges = <0x0 0x25000 0x1000>;
> local-mac-address = [ 00 00 00 00 00 00 ];
> - interrupts = <34 0x8 33 0x8 32 0x8>;
> + interrupts = <35 0x8 36 0x8 37 0x8>;
I used these interrupt number fixes as well but it was not necessary for
the actual net-next-2.6 tree. Need to check why? I remember some version
dependent re-mapping code.
> interrupt-parent = <&ipic>;
> tbi-handle = < &tbi1 >;
> phy-handle = < &phy4 >;
> diff --git a/arch/powerpc/platforms/83xx/mpc831x_rdb.c b/arch/powerpc/platforms/83xx/mpc831x_rdb.c
> index 0b4f883..7f80269 100644
> --- a/arch/powerpc/platforms/83xx/mpc831x_rdb.c
> +++ b/arch/powerpc/platforms/83xx/mpc831x_rdb.c
> @@ -20,6 +20,7 @@
> #include <asm/ipic.h>
> #include <asm/udbg.h>
> #include <sysdev/fsl_pci.h>
> +#include <sysdev/fsl_soc.h>
>
> #include "mpc83xx.h"
>
> @@ -31,6 +32,8 @@ static void __init mpc831x_rdb_setup_arch(void)
> #ifdef CONFIG_PCI
> struct device_node *np;
> #endif
> + void __iomem *immap;
> + unsigned long spcr, sicrh;
>
> if (ppc_md.progress)
> ppc_md.progress("mpc831x_rdb_setup_arch()", 0);
> @@ -42,6 +45,18 @@ static void __init mpc831x_rdb_setup_arch(void)
> mpc83xx_add_bridge(np);
> #endif
> mpc831x_usb_cfg();
> +
> +#define MPC83XX_SPCR_OFFS 0x110
> +#define MPC8313_SPCR_1588_PPS 0x00004000
> +#define MPC8313_SICRH_1588_PPS 0x01000000
> +
> + immap = ioremap(get_immrbase(), 0x1000);
> + spcr = in_be32(immap + MPC83XX_SPCR_OFFS);
> + sicrh = in_be32(immap + MPC83XX_SICRH_OFFS);
> + sicrh |= MPC8313_SICRH_1588_PPS;
> + out_be32(immap + MPC83XX_SICRH_OFFS, sicrh);
> + spcr |= MPC8313_SPCR_1588_PPS;
> + out_be32(immap + MPC83XX_SPCR_OFFS, spcr);
> }
That's missing to get the PPS signal output. But it should probably go
to gianfar_ptp.c.
Wolfgang.
^ permalink raw reply
* Re: [PATCH 0/3] [RFC] ptp: IEEE 1588 clock support
From: Wolfgang Grandegger @ 2010-04-29 12:02 UTC (permalink / raw)
To: Richard Cochran; +Cc: netdev
In-Reply-To: <20100427091344.GA5086@riccoc20.at.omicron.at>
Richard Cochran wrote:
> Now and again there has been some talk on this list of adding PTP
> support into Linux. One part of the picture is already in place, the
> SO_TIMESTAMPING API for hardware time stamping. It has been pointed
> out that this API is not perfect, however, it is good enough for many
> real world uses of IEEE 1588. The second needed part has not, AFAICT,
> ever been addressed.
>
> Here I offer an early draft of an idea how to bring the missing
> functionality into Linux. I don't yet have all of the features
> implemented, as described below. Still I would like to get your
> feedback concerning this idea before getting too far into it. I do
> have all of the hardware mentioned at hand, so I have a good idea that
> the proposed API covers the features of those clocks.
>
> Thanks in advance for your comments,
>
> Richard
>
> * PTP infrastructure for Linux
>
> This patch set introduces support for IEEE 1588 PTP clocks in
> Linux. Together with the SO_TIMESTAMPING socket options, this
> presents standardized method for developing PTP user space programs,
> synchronizing Linux with external clocks, and using the ancillary
> features of PTP hardware clocks.
>
> A new class driver exports a kernel interface for specific clock
> drivers and a user space interface. The infrastructure supports a
> complete set of PTP functionality.
>
> + Basic clock operations
> - Set time
> - Get time
> - Shift the clock by a given offset atomically
> - Adjust clock frequency
>
> + Ancillary clock features
> - One short or periodic alarms, with signal delivery to user program
> - Time stamp external events
> - Period output signals configurable from user space
> - Synchronization of the Linux system time via the PPS subsystem
>
> ** PTP kernel API
>
> A PTP clock driver registers itself with the class driver. The
> class driver handles all of the dealings with user space. The
> author of a clock driver need only implement the details of
> programming the clock hardware. The clock driver notifies the class
> driver of asynchronous events (alarms and external time stamps) via
> a simple message passing interface.
>
> The class driver supports multiple PTP clock drivers. In normal use
> cases, only one PTP clock is needed. However, for testing and
> development, it can be useful to have more than one clock in a
> single system, in order to allow performance comparisons.
>
> ** PTP user space API
>
> The class driver creates a character device for each registered PTP
> clock. User space programs may control the clock via standardized
> ioctls. A program may query, enable, configure, and disable the
> ancillary clock features. User space can receive time stamped
> events via blocking read() and poll(). One shot and periodic
> signals may be configured via an ioctl API with similar semantics
> to the POSIX timer_settime() system call.
>
> ** Supported hardware
>
> + Standard Linux system timer
> - No special PTP features
> - For use with software time stamping
>
> + Freescale eTSEC gianfar
> - 2 Time stamp external triggers, programmable polarity (opt. interrupt)
> - 2 Alarm registers (optional interrupt)
> - 3 Periodic signals (optional interrupt)
>
> + National DP83640
> - 6 GPIOs programmable as inputs or outputs
> - 6 GPIOs with dedicated functions (LED/JTAG/clock) can also be
> used as general inputs or outputs
> - GPIO inputs can time stamp external triggers
> - GPIO outputs can produce periodic signals
> - 1 interrupt pin
>
> + Intel IXP465
> - Auxiliary Slave/Master Mode Snapshot (optional interrupt)
> - Target Time (optional interrupt)
I realized two other netdev drivers already supporting PTP timestamping:
igb and bfin_mac. From the PTP developer point of view, the interface
looks rather complete to me and it works fine on my MPC8313 setup. The
only thing I stumbled over was that PTP clock registration failed when
PTP support is statically linked into the kernel.
Thanks,
Wolfgang.
^ permalink raw reply
* Re: ixgbe and mac-vlans problem
From: Ben Greear @ 2010-04-30 18:09 UTC (permalink / raw)
To: Arnd Bergmann; +Cc: NetDev, Patrick McHardy
In-Reply-To: <201004302000.58763.arnd@arndb.de>
On 04/30/2010 11:00 AM, Arnd Bergmann wrote:
> On Friday 30 April 2010 00:27:39 Ben Greear wrote:
>> Basically, we create 50 mac-vlans, with sequential MAC addresses and sequential
>> IP addresses, and set up ip rules properly.
>>
>> The issue is that only 10 or so of the mac-vlans receive other than
>> broadcast packets. The ixgbe NIC doesn't show PROMISC mode.
>
> I just took a brief look at the driver and noticed that 82599 should
> be able to handle 128 entries before going into promisc mode, while
> 82598 (the same driver) does 16.
>
> Maybe the logic for>16 entries is wrong, so you could try forcing
> hw->mac.num_rar_entries to 16 for 82599 as well.
I think I was actually on an 825998 system when I saw it yesterday,
but I have seen similar issues on 82599, though I didn't take time
to debug it fully, so it could have been something else.
I will double-check the NIC chipset on the system that showed the
problem yesterday.
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply
* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: jamal @ 2010-04-29 13:17 UTC (permalink / raw)
To: Eric Dumazet
Cc: Changli Gao, David Miller, therbert, shemminger, netdev,
Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272545108.2222.65.camel@edumazet-laptop>
On Thu, 2010-04-29 at 14:45 +0200, Eric Dumazet wrote:
>
> Changli, I wonder how you can cook "performance" patches without testing
> them at all for real... This cannot be true ?
Eric, I am with you, however you are in the minority of people who test
and produce numbers ;-> The system rewards people for sending patches
not much for anything else - so i cant blame Changli ;->
> When the cpu doing the device softirq is flooded, it handles 300 packets
> per net_rx_action() round (netdev_budget), so sends at most 6 ipis per
> 300 packets, with or without my patch, with or without your patch as
> well.
>
> (At most because if remote cpus are flooded as well, they dont
> napi_complete so no IPI needed at all)
>
> (My patch had an effect only on normal load, ie one packet received in a
> while... up to 50.000 pps I would say). And it also has a nice effect on
> non RPS loads (mostly the more typical load for following years).
> If a second packet comes 3us after the first one, and before 2nd CPU
> handled it, we _can_ afford an extra IPI.
>
> 750.000/50 = 15.000 IPI per second.
Could we have some stat in there that shows IPIs being produced? I think
it would help to at least observe any changes over variety of tests.
I did try to patch my system during the first few tests to record IPIs
but it seems to make more sense to have it as a perf stat.
> Even with 200.000 IPI per second, 'perf top -C CPU_IPI_sender' shows
> that sending IPI is very cheap (maybe ~1% of cpu cycles)
>
> # Samples: 32033467127
> #
One thing i observed is our profiles seem different. Could you send me
your .config for a single nehalem and i will try to go as close as
possible to it? I have a sky2 instead of bnx - but i suspect everything
else will be very similar...
I apologize i dont have much time to look into details - but what i can
do is test at least.
cheers,
jamal
^ permalink raw reply
* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: Changli Gao @ 2010-04-29 23:07 UTC (permalink / raw)
To: Eric Dumazet
Cc: hadi, David Miller, therbert, shemminger, netdev,
Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272545108.2222.65.camel@edumazet-laptop>
On Thu, Apr 29, 2010 at 8:45 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Changli, I wonder how you can cook "performance" patches without testing
> them at all for real... This cannot be true ?
>
I am sorry. But I wasn't against your patch, and I just wanted to
understand the test result from jamal. It is my fault submitting a
performance patch without testing them. I should not reply on code
inspection for the performance patch.
--
Regards,
Changli Gao(xiaosuo@gmail.com)
^ permalink raw reply
* Re: OFT - reserving CPU's for networking
From: Brian Bloniarz @ 2010-04-30 18:15 UTC (permalink / raw)
To: Eric Dumazet
Cc: Thomas Gleixner, Stephen Hemminger, netdev, Andi Kleen,
Peter Zijlstra
In-Reply-To: <1272571339.2209.76.camel@edumazet-laptop>
Eric Dumazet wrote:
> Le jeudi 29 avril 2010 à 21:19 +0200, Thomas Gleixner a écrit :
>
>> Say thanks to Intel/AMD for providing us timers which stop in lower
>> c-states.
>>
>> Not much we can do about the broadcast lock when several cores are
>> going idle and we need to setup a global timer to work around the
>> lapic timer stops in C2/C3 issue.
>>
>> Simply the C-state timer broadcasting does not scale. And it was never
>> meant to scale. It's a workaround for laptops to have functional NOHZ.
>>
>> There are several ways to work around that on larger machines:
>>
>> - Restrict c-states
>> - Disable NOHZ and highres timers
>> - idle=poll is definitely the worst of all possible solutions
>>
>>> I keep getting asked about taking some core's away from clock and scheduler
>>> to be reserved just for network processing. Seeing this kind of stuff
>>> makes me wonder if maybe that isn't a half bad idea.
>> This comes up every few month and we pointed out several times what
>> needs to be done to make this work w/o these weird hacks which put a
>> core offline and then start some magic undebugable binary blob on it.
>> We have not seen anyone working on this, but the "set cores aside and
>> let them do X" idea seems to stick in peoples heads.
>>
>> Seriously, that's not a solution. It's going to be some hacked up
>> nightmare which is completely unmaintainable.
>>
>> Aside of that I seriously doubt that you can do networking w/o time
>> and timers.
>>
>
> Thanks a lot !
>
> booting with processor.max_cstate=1 solves the problem
>
> (I already had a CONFIG_NO_HZ=no conf, but highres timer enabled)
>
> Even with _carefuly_ chosen crazy configuration (receiving a packet on a
> cpu, then transfert it to another cpu, with a full 16x16 matrix
> involved), generating 700.000 IPI per second on the machine seems fine
> now.
FYI you can also restrict c=states at runtime with PM QoS:
Documentation/power/pm_qos_interface.txt
On my machine, /sys/devices/system/cpu/cpu0/cpuidle/state2/latency
is 205usec, so configuring a PM QoS request for <= 205usec latency
should prevent it being entered:
#!/usr/bin/python
import os;
import struct;
import signal;
latency_rec_usec = 100
f = os.open("/dev/cpu_dma_latency", os.O_WRONLY);
os.write(f, struct.pack("=i", latency_rec_usec));
signal.pause();
^ permalink raw reply
* avoid compiler warning when !CONFIG_SYSCTL in netfilter dccp
From: Mathieu Lacage @ 2010-04-29 9:25 UTC (permalink / raw)
To: netdev
[-- Attachment #1: Type: text/plain, Size: 333 bytes --]
The attached trivial patch (generated against davem/net-next-2.6 as of
this morning) avoids this compiler warning:
../src/process-manager/linux/net/netfilter/nf_conntrack_proto_dccp.c: In
function ‘dccp_net_exit’:
../src/process-manager/linux/net/netfilter/nf_conntrack_proto_dccp.c:845: error: unused variable ‘dn’
Mathieu
[-- Attachment #2: nf-dccp-sysctl.patch --]
[-- Type: text/x-patch, Size: 541 bytes --]
diff --git a/net/netfilter/nf_conntrack_proto_dccp.c b/net/netfilter/nf_conntrack_proto_dccp.c
index 5292560..cd078df 100644
--- a/net/netfilter/nf_conntrack_proto_dccp.c
+++ b/net/netfilter/nf_conntrack_proto_dccp.c
@@ -842,8 +842,8 @@ static __net_init int dccp_net_init(struct net *net)
static __net_exit void dccp_net_exit(struct net *net)
{
- struct dccp_net *dn = dccp_pernet(net);
#ifdef CONFIG_SYSCTL
+ struct dccp_net *dn = dccp_pernet(net);
unregister_net_sysctl_table(dn->sysctl_header);
kfree(dn->sysctl_table);
#endif
^ permalink raw reply related
* r8169 INFO: inconsistent lock state
From: Sergey Senozhatsky @ 2010-04-30 18:20 UTC (permalink / raw)
To: Eric Dumazet
Cc: Oleg Nesterov, David Miller, Ingo Molnar, Francois Romieu,
Peter Zijlstra, netdev, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 6536 bytes --]
Hello,
Yet another one (during resume):
kernel: [ 1968.334646]
kernel: [ 1968.334648] =================================
kernel: [ 1968.334651] [ INFO: inconsistent lock state ]
kernel: [ 1968.334654] 2.6.34-rc6-dbg #105
kernel: [ 1968.334656] ---------------------------------
kernel: [ 1968.334659] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
kernel: [ 1968.334663] events/1/3854 [HC0[0]:SC0[0]:HE1:SE1] takes:
kernel: [ 1968.334666] (&(&table->hash[i].lock)->rlock){+.?...}, at: [<c1292ec4>] __udp4_lib_mcast_deliver+0x3c/0x143
kernel: [ 1968.334678] {IN-SOFTIRQ-W} state was registered at:
kernel: [ 1968.334681] [<c104fc8d>] __lock_acquire+0x2ba/0xc01
kernel: [ 1968.334688] [<c10509df>] lock_acquire+0x5e/0x75
kernel: [ 1968.334693] [<c12c366a>] _raw_spin_lock+0x28/0x58
kernel: [ 1968.334699] [<c1292ec4>] __udp4_lib_mcast_deliver+0x3c/0x143
kernel: [ 1968.334704] [<c12931a7>] __udp4_lib_rcv+0x1dc/0x3ac
kernel: [ 1968.334708] [<c1293389>] udp_rcv+0x12/0x14
kernel: [ 1968.334713] [<c127605f>] ip_local_deliver_finish+0xd2/0x137
kernel: [ 1968.334719] [<c12760fc>] NF_HOOK.clone.1+0x38/0x3f
kernel: [ 1968.334724] [<c1276220>] ip_local_deliver+0x3c/0x42
kernel: [ 1968.334728] [<c1275f2e>] ip_rcv_finish+0x25c/0x27e
kernel: [ 1968.334733] [<c12760fc>] NF_HOOK.clone.1+0x38/0x3f
kernel: [ 1968.334737] [<c12763c9>] ip_rcv+0x1a3/0x1c6
kernel: [ 1968.334741] [<c12593d7>] netif_receive_skb+0x38b/0x3ab
kernel: [ 1968.334747] [<fd20f911>] rtl8169_rx_interrupt+0x2de/0x3eb [r8169]
kernel: [ 1968.334756] [<fd211cde>] rtl8169_poll+0x28/0x15d [r8169]
kernel: [ 1968.334763] [<c12596b3>] net_rx_action+0x93/0x181
kernel: [ 1968.334767] [<c1032a72>] __do_softirq+0x88/0x10c
kernel: [ 1968.334773] [<c1032b25>] do_softirq+0x2f/0x47
kernel: [ 1968.334778] [<c1032de2>] irq_exit+0x38/0x75
kernel: [ 1968.334782] [<c1004489>] do_IRQ+0x79/0x8d
kernel: [ 1968.334787] [<c1002db5>] common_interrupt+0x35/0x3c
kernel: [ 1968.334791] [<c1246f43>] cpuidle_idle_call+0x6a/0xa0
kernel: [ 1968.334799] [<c100171b>] cpu_idle+0x89/0xbe
kernel: [ 1968.334802] [<c12b3d49>] rest_init+0xd1/0xd6
kernel: [ 1968.334807] [<c147e7bd>] start_kernel+0x339/0x33e
kernel: [ 1968.334813] [<c147e0c9>] i386_start_kernel+0xc9/0xd0
kernel: [ 1968.334818] irq event stamp: 63
kernel: [ 1968.334820] hardirqs last enabled at (63): [<c109d7ff>] kmem_cache_free+0x83/0x8f
kernel: [ 1968.334828] hardirqs last disabled at (62): [<c109d7a6>] kmem_cache_free+0x2a/0x8f
kernel: [ 1968.334833] softirqs last enabled at (60): [<c126400a>] rcu_read_unlock_bh+0x1c/0x1e
kernel: [ 1968.334839] softirqs last disabled at (58): [<c1263faf>] rcu_read_lock_bh+0x8/0x26
kernel: [ 1968.334845]
kernel: [ 1968.334846] other info that might help us debug this:
kernel: [ 1968.334849] 5 locks held by events/1/3854:
kernel: [ 1968.334851] #0: (events){+.+.+.}, at: [<c103c8e9>] worker_thread+0x128/0x23c
kernel: [ 1968.334859] #1: ((&(&tp->task)->work)){+.+...}, at: [<c103c8e9>] worker_thread+0x128/0x23c
kernel: [ 1968.334865] #2: (rtnl_mutex){+.+.+.}, at: [<c1262b8f>] rtnl_lock+0xf/0x11
kernel: [ 1968.334871] #3: (rcu_read_lock){.+.+..}, at: [<c125784b>] rcu_read_lock+0x0/0x2b
kernel: [ 1968.334877] #4: (rcu_read_lock){.+.+..}, at: [<c1275c56>] rcu_read_lock+0x0/0x2b
kernel: [ 1968.334884]
kernel: [ 1968.334885] stack backtrace:
kernel: [ 1968.334888] Pid: 3854, comm: events/1 Not tainted 2.6.34-rc6-dbg #105
kernel: [ 1968.334891] Call Trace:
kernel: [ 1968.334895] [<c12c1906>] ? printk+0xf/0x11
kernel: [ 1968.334901] [<c104e7d9>] valid_state+0x133/0x141
kernel: [ 1968.334906] [<c104e8b6>] mark_lock+0xcf/0x1bc
kernel: [ 1968.334911] [<c104e11f>] ? check_usage_backwards+0x0/0x72
kernel: [ 1968.334915] [<c104fcff>] __lock_acquire+0x32c/0xc01
kernel: [ 1968.334922] [<c129ee2d>] ? fib_table_lookup+0x81/0x8e
kernel: [ 1968.334927] [<c100772e>] ? __cycles_2_ns+0xf/0x3e
kernel: [ 1968.334932] [<c12671b6>] ? rcu_read_unlock+0x0/0x38
kernel: [ 1968.334937] [<c1007a30>] ? native_sched_clock+0x49/0x4f
kernel: [ 1968.334943] [<c10443a9>] ? sched_clock_local+0x11/0x11f
kernel: [ 1968.334948] [<c10509df>] lock_acquire+0x5e/0x75
kernel: [ 1968.334953] [<c1292ec4>] ? __udp4_lib_mcast_deliver+0x3c/0x143
kernel: [ 1968.334958] [<c12c366a>] _raw_spin_lock+0x28/0x58
kernel: [ 1968.334963] [<c1292ec4>] ? __udp4_lib_mcast_deliver+0x3c/0x143
kernel: [ 1968.334967] [<c1292ec4>] __udp4_lib_mcast_deliver+0x3c/0x143
kernel: [ 1968.334973] [<c104463c>] ? sched_clock_cpu+0x121/0x131
kernel: [ 1968.334978] [<c12735b5>] ? rcu_read_unlock+0x0/0x38
kernel: [ 1968.334983] [<c104463c>] ? sched_clock_cpu+0x121/0x131
kernel: [ 1968.334988] [<c10505c5>] ? __lock_acquire+0xbf2/0xc01
kernel: [ 1968.334994] [<c12735e2>] ? rcu_read_unlock+0x2d/0x38
kernel: [ 1968.334998] [<c1274034>] ? ip_route_input+0x101/0xaf4
kernel: [ 1968.335003] [<c12931a7>] __udp4_lib_rcv+0x1dc/0x3ac
kernel: [ 1968.335008] [<c1293389>] udp_rcv+0x12/0x14
kernel: [ 1968.335013] [<c127605f>] ip_local_deliver_finish+0xd2/0x137
kernel: [ 1968.335017] [<c1275f8d>] ? ip_local_deliver_finish+0x0/0x137
kernel: [ 1968.335022] [<c12760fc>] NF_HOOK.clone.1+0x38/0x3f
kernel: [ 1968.335026] [<c1276220>] ip_local_deliver+0x3c/0x42
kernel: [ 1968.335031] [<c1275f8d>] ? ip_local_deliver_finish+0x0/0x137
kernel: [ 1968.335035] [<c1275f2e>] ip_rcv_finish+0x25c/0x27e
kernel: [ 1968.335040] [<c1275cd2>] ? ip_rcv_finish+0x0/0x27e
kernel: [ 1968.335044] [<c12760fc>] NF_HOOK.clone.1+0x38/0x3f
kernel: [ 1968.335048] [<c12763c9>] ip_rcv+0x1a3/0x1c6
kernel: [ 1968.335052] [<c1275cd2>] ? ip_rcv_finish+0x0/0x27e
kernel: [ 1968.335057] [<c12593d7>] netif_receive_skb+0x38b/0x3ab
kernel: [ 1968.335066] [<fd20f911>] rtl8169_rx_interrupt+0x2de/0x3eb [r8169]
kernel: [ 1968.335073] [<fd20fc9b>] rtl8169_reset_task+0x33/0xe8 [r8169]
kernel: [ 1968.335077] [<c103c92b>] worker_thread+0x16a/0x23c
kernel: [ 1968.335082] [<c103c8e9>] ? worker_thread+0x128/0x23c
kernel: [ 1968.335088] [<fd20fc68>] ? rtl8169_reset_task+0x0/0xe8 [r8169]
kernel: [ 1968.335095] [<c103fa46>] ? autoremove_wake_function+0x0/0x2f
kernel: [ 1968.335099] [<c103c7c1>] ? worker_thread+0x0/0x23c
kernel: [ 1968.335103] [<c103f76a>] kthread+0x6a/0x6f
kernel: [ 1968.335108] [<c103f700>] ? kthread+0x0/0x6f
kernel: [ 1968.335112] [<c1002dc2>] kernel_thread_helper+0x6/0x10
kernel: [ 1968.335282] r8169 0000:02:00.0: eth0: link down
Sergey
[-- Attachment #2: Type: application/pgp-signature, Size: 316 bytes --]
^ permalink raw reply
* Re: [PATCH] [RFC] C/R: inet4 and inet6 unicast routes (v2)
From: Serge E. Hallyn @ 2010-04-30 18:19 UTC (permalink / raw)
To: Dan Smith; +Cc: containers, Vlad Yasevich, netdev, David Miller
In-Reply-To: <1272646855-17327-1-git-send-email-danms@us.ibm.com>
Quoting Dan Smith (danms@us.ibm.com):
> +static int temp_netns_enter(struct net *net)
> +{
> + int ret;
> + struct net *tmp_netns;
> +
> + ret = copy_namespaces(CLONE_NEWNET, current);
> + if (ret)
> + return ret;
Actually there is one problem here - copy_namespaces() is
specifically used only by clone() and it expects tsk to
not yet be live. So it just does
tsk->nsproxy = new_ns
Since you're doing this on current which is live, it would
have to use rcu_assign_pointer() to be safe.
So I'm afraid you're going to have to do a slightly uglier
thing where you unshare_nsproxy_namespaces() and then
switch_task_namespaces() to the new nsproxy.
> +
> + tmp_netns = current->nsproxy->net_ns;
> + get_net(net);
> + current->nsproxy->net_ns = net;
> + put_net(tmp_netns);
> +
> + return 0;
> +}
Otherwise it looks good to me. My only other comment would be to soothe
readers' anxieties by putting a comment right here explaining that
switch_task_namespaces() will drop your ref to current->nsproxy->net_ns,
and that you had never dropped the ref to prev so it will be safe.
> +static void temp_netns_exit(struct nsproxy *prev)
> +{
> + switch_task_namespaces(current, prev);
> +}
thanks,
-serge
^ permalink raw reply
* [PATCH net-next-2.6] net: sock_def_readable() and friends RCU conversion
From: Eric Dumazet @ 2010-04-29 21:01 UTC (permalink / raw)
To: hadi
Cc: Changli Gao, David Miller, therbert, shemminger, netdev,
Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272573383.3969.8.camel@bigi>
Le jeudi 29 avril 2010 à 16:36 -0400, jamal a écrit :
> Results attached.
> With your app rps does a hell lot better and non-rps worse ;->
> With my proggie, non-rps does much better than yours and rps does
> a lot worse for same setup. I see the scheduler kicking quiet a bit in
> non-rps for you...
>
> The main difference between us as i see it is:
> a) i use epoll - actually linked to libevent (1.0.something)
> b) I fork processes and you use pthreads.
>
> I dont have time to chase it today, but 1) I am either going to change
> yours to use libevent or make mine get rid of it then 2) move towards
> pthreads or have yours fork..
> then observe if that makes any difference..
>
Thanks !
Here is last 'patch of the day' for me ;)
Next one will be able to coalesce wakeup calls (they'll be delayed at
the end of net_rx_action(), like a patch I did last year to help
multicast reception)
vger seems to be down, I suspect I'll have to resend it later.
[PATCH net-next-2.6] net: sock_def_readable() and friends RCU conversion
sk_callback_lock rwlock actually protects sk->sk_sleep pointer, so we
need two atomic operations (and associated dirtying) per incoming
packet.
RCU conversion is pretty much needed :
1) Add a new structure, called "struct socket_wq" to hold all fields
that will need rcu_read_lock() protection (currently: a
wait_queue_head_t and a struct fasync_struct pointer).
[Future patch will add a list anchor for wakeup coalescing]
2) Attach one of such structure to each "struct socket" created in
sock_alloc_inode().
3) Respect RCU grace period when freeing a "struct socket_wq"
4) Change sk_sleep pointer in "struct sock" by sk_wq, pointer to "struct
socket_wq"
5) Change sk_sleep() function to use new sk->sk_wq instead of
sk->sk_sleep
6) Change sk_has_sleeper() to wq_has_sleeper() that must be used inside
a rcu_read_lock() section.
7) Change all sk_has_sleeper() callers to :
- Use rcu_read_lock() instead of read_lock(&sk->sk_callback_lock)
- Use wq_has_sleeper() to eventually wakeup tasks.
- Use rcu_read_unlock() instead of read_unlock(&sk->sk_callback_lock)
8) sock_wake_async() is modified to use rcu protection as well.
9) Exceptions :
macvtap, drivers/net/tun.c, af_unix use integrated "struct socket_wq"
instead of dynamically allocated ones. They dont need rcu freeing.
Some cleanups or followups are probably needed, (possible
sk_callback_lock conversion to a spinlock for example...).
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
drivers/net/macvtap.c | 13 +++++++---
drivers/net/tun.c | 21 +++++++++-------
include/linux/net.h | 14 +++++++----
include/net/af_unix.h | 20 ++++++++--------
include/net/sock.h | 40 ++++++++++++++++----------------
net/atm/common.c | 22 +++++++++++------
net/core/sock.c | 50 ++++++++++++++++++++++++----------------
net/core/stream.c | 10 +++++---
net/dccp/output.c | 10 ++++----
net/iucv/af_iucv.c | 11 +++++---
net/phonet/pep.c | 8 +++---
net/phonet/socket.c | 2 -
net/rxrpc/af_rxrpc.c | 10 ++++----
net/sctp/socket.c | 2 -
net/socket.c | 47 ++++++++++++++++++++++++++++---------
net/unix/af_unix.c | 17 ++++++-------
16 files changed, 182 insertions(+), 115 deletions(-)
diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index d97e1fd..1c4110d 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -37,6 +37,7 @@
struct macvtap_queue {
struct sock sk;
struct socket sock;
+ struct socket_wq wq;
struct macvlan_dev *vlan;
struct file *file;
unsigned int flags;
@@ -242,12 +243,15 @@ static struct rtnl_link_ops macvtap_link_ops __read_mostly = {
static void macvtap_sock_write_space(struct sock *sk)
{
+ wait_queue_head_t *wqueue;
+
if (!sock_writeable(sk) ||
!test_and_clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags))
return;
- if (sk_sleep(sk) && waitqueue_active(sk_sleep(sk)))
- wake_up_interruptible_poll(sk_sleep(sk), POLLOUT | POLLWRNORM | POLLWRBAND);
+ wqueue = sk_sleep(sk);
+ if (wqueue && waitqueue_active(wqueue))
+ wake_up_interruptible_poll(wqueue, POLLOUT | POLLWRNORM | POLLWRBAND);
}
static int macvtap_open(struct inode *inode, struct file *file)
@@ -272,7 +276,8 @@ static int macvtap_open(struct inode *inode, struct file *file)
if (!q)
goto out;
- init_waitqueue_head(&q->sock.wait);
+ q->sock.wq = &q->wq;
+ init_waitqueue_head(&q->wq.wait);
q->sock.type = SOCK_RAW;
q->sock.state = SS_CONNECTED;
q->sock.file = file;
@@ -308,7 +313,7 @@ static unsigned int macvtap_poll(struct file *file, poll_table * wait)
goto out;
mask = 0;
- poll_wait(file, &q->sock.wait, wait);
+ poll_wait(file, &q->wq.wait, wait);
if (!skb_queue_empty(&q->sk.sk_receive_queue))
mask |= POLLIN | POLLRDNORM;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 20a1793..e525a6c 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -109,7 +109,7 @@ struct tun_struct {
struct tap_filter txflt;
struct socket socket;
-
+ struct socket_wq wq;
#ifdef TUN_DEBUG
int debug;
#endif
@@ -323,7 +323,7 @@ static void tun_net_uninit(struct net_device *dev)
/* Inform the methods they need to stop using the dev.
*/
if (tfile) {
- wake_up_all(&tun->socket.wait);
+ wake_up_all(&tun->wq.wait);
if (atomic_dec_and_test(&tfile->count))
__tun_detach(tun);
}
@@ -398,7 +398,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
/* Notify and wake up reader process */
if (tun->flags & TUN_FASYNC)
kill_fasync(&tun->fasync, SIGIO, POLL_IN);
- wake_up_interruptible_poll(&tun->socket.wait, POLLIN |
+ wake_up_interruptible_poll(&tun->wq.wait, POLLIN |
POLLRDNORM | POLLRDBAND);
return NETDEV_TX_OK;
@@ -498,7 +498,7 @@ static unsigned int tun_chr_poll(struct file *file, poll_table * wait)
DBG(KERN_INFO "%s: tun_chr_poll\n", tun->dev->name);
- poll_wait(file, &tun->socket.wait, wait);
+ poll_wait(file, &tun->wq.wait, wait);
if (!skb_queue_empty(&sk->sk_receive_queue))
mask |= POLLIN | POLLRDNORM;
@@ -773,7 +773,7 @@ static ssize_t tun_do_read(struct tun_struct *tun,
DBG(KERN_INFO "%s: tun_chr_read\n", tun->dev->name);
- add_wait_queue(&tun->socket.wait, &wait);
+ add_wait_queue(&tun->wq.wait, &wait);
while (len) {
current->state = TASK_INTERRUPTIBLE;
@@ -804,7 +804,7 @@ static ssize_t tun_do_read(struct tun_struct *tun,
}
current->state = TASK_RUNNING;
- remove_wait_queue(&tun->socket.wait, &wait);
+ remove_wait_queue(&tun->wq.wait, &wait);
return ret;
}
@@ -861,6 +861,7 @@ static struct rtnl_link_ops tun_link_ops __read_mostly = {
static void tun_sock_write_space(struct sock *sk)
{
struct tun_struct *tun;
+ wait_queue_head_t *wqueue;
if (!sock_writeable(sk))
return;
@@ -868,8 +869,9 @@ static void tun_sock_write_space(struct sock *sk)
if (!test_and_clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags))
return;
- if (sk_sleep(sk) && waitqueue_active(sk_sleep(sk)))
- wake_up_interruptible_sync_poll(sk_sleep(sk), POLLOUT |
+ wqueue = sk_sleep(sk);
+ if (wqueue && waitqueue_active(wqueue))
+ wake_up_interruptible_sync_poll(wqueue, POLLOUT |
POLLWRNORM | POLLWRBAND);
tun = tun_sk(sk)->tun;
@@ -1039,7 +1041,8 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
if (!sk)
goto err_free_dev;
- init_waitqueue_head(&tun->socket.wait);
+ tun->socket.wq = &tun->wq;
+ init_waitqueue_head(&tun->wq.wait);
tun->socket.ops = &tun_socket_ops;
sock_init_data(&tun->socket, sk);
sk->sk_write_space = tun_sock_write_space;
diff --git a/include/linux/net.h b/include/linux/net.h
index 4157b5d..2b4deee 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -59,6 +59,7 @@ typedef enum {
#include <linux/wait.h>
#include <linux/fcntl.h> /* For O_CLOEXEC and O_NONBLOCK */
#include <linux/kmemcheck.h>
+#include <linux/rcupdate.h>
struct poll_table_struct;
struct pipe_inode_info;
@@ -116,6 +117,12 @@ enum sock_shutdown_cmd {
SHUT_RDWR = 2,
};
+struct socket_wq {
+ wait_queue_head_t wait;
+ struct fasync_struct *fasync_list;
+ struct rcu_head rcu;
+} ____cacheline_aligned_in_smp;
+
/**
* struct socket - general BSD socket
* @state: socket state (%SS_CONNECTED, etc)
@@ -135,11 +142,8 @@ struct socket {
kmemcheck_bitfield_end(type);
unsigned long flags;
- /*
- * Please keep fasync_list & wait fields in the same cache line
- */
- struct fasync_struct *fasync_list;
- wait_queue_head_t wait;
+
+ struct socket_wq *wq;
struct file *file;
struct sock *sk;
diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 1614d78..20725e2 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -30,7 +30,7 @@ struct unix_skb_parms {
#endif
};
-#define UNIXCB(skb) (*(struct unix_skb_parms*)&((skb)->cb))
+#define UNIXCB(skb) (*(struct unix_skb_parms *)&((skb)->cb))
#define UNIXCREDS(skb) (&UNIXCB((skb)).creds)
#define UNIXSID(skb) (&UNIXCB((skb)).secid)
@@ -45,21 +45,23 @@ struct unix_skb_parms {
struct unix_sock {
/* WARNING: sk has to be the first member */
struct sock sk;
- struct unix_address *addr;
- struct dentry *dentry;
- struct vfsmount *mnt;
+ struct unix_address *addr;
+ struct dentry *dentry;
+ struct vfsmount *mnt;
struct mutex readlock;
- struct sock *peer;
- struct sock *other;
+ struct sock *peer;
+ struct sock *other;
struct list_head link;
- atomic_long_t inflight;
- spinlock_t lock;
+ atomic_long_t inflight;
+ spinlock_t lock;
unsigned int gc_candidate : 1;
unsigned int gc_maybe_cycle : 1;
- wait_queue_head_t peer_wait;
+ struct socket_wq peer_wq;
};
#define unix_sk(__sk) ((struct unix_sock *)__sk)
+#define peer_wait peer_wq.wait
+
#ifdef CONFIG_SYSCTL
extern int unix_sysctl_register(struct net *net);
extern void unix_sysctl_unregister(struct net *net);
diff --git a/include/net/sock.h b/include/net/sock.h
index d361c77..03d0046 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -159,7 +159,7 @@ struct sock_common {
* @sk_userlocks: %SO_SNDBUF and %SO_RCVBUF settings
* @sk_lock: synchronizer
* @sk_rcvbuf: size of receive buffer in bytes
- * @sk_sleep: sock wait queue
+ * @sk_wq: sock wait queue and async head
* @sk_dst_cache: destination cache
* @sk_dst_lock: destination cache lock
* @sk_policy: flow policy
@@ -257,7 +257,7 @@ struct sock {
struct sk_buff *tail;
int len;
} sk_backlog;
- wait_queue_head_t *sk_sleep;
+ struct socket_wq *sk_wq;
struct dst_entry *sk_dst_cache;
#ifdef CONFIG_XFRM
struct xfrm_policy *sk_policy[2];
@@ -1219,7 +1219,7 @@ static inline void sk_set_socket(struct sock *sk, struct socket *sock)
static inline wait_queue_head_t *sk_sleep(struct sock *sk)
{
- return sk->sk_sleep;
+ return &sk->sk_wq->wait;
}
/* Detach socket from process context.
* Announce socket dead, detach it from wait queue and inode.
@@ -1233,14 +1233,14 @@ static inline void sock_orphan(struct sock *sk)
write_lock_bh(&sk->sk_callback_lock);
sock_set_flag(sk, SOCK_DEAD);
sk_set_socket(sk, NULL);
- sk->sk_sleep = NULL;
+ sk->sk_wq = NULL;
write_unlock_bh(&sk->sk_callback_lock);
}
static inline void sock_graft(struct sock *sk, struct socket *parent)
{
write_lock_bh(&sk->sk_callback_lock);
- sk->sk_sleep = &parent->wait;
+ rcu_assign_pointer(sk->sk_wq, parent->wq);
parent->sk = sk;
sk_set_socket(sk, parent);
security_sock_graft(sk, parent);
@@ -1392,12 +1392,12 @@ static inline int sk_has_allocations(const struct sock *sk)
}
/**
- * sk_has_sleeper - check if there are any waiting processes
- * @sk: socket
+ * wq_has_sleeper - check if there are any waiting processes
+ * @sk: struct socket_wq
*
- * Returns true if socket has waiting processes
+ * Returns true if socket_wq has waiting processes
*
- * The purpose of the sk_has_sleeper and sock_poll_wait is to wrap the memory
+ * The purpose of the wq_has_sleeper and sock_poll_wait is to wrap the memory
* barrier call. They were added due to the race found within the tcp code.
*
* Consider following tcp code paths:
@@ -1410,9 +1410,10 @@ static inline int sk_has_allocations(const struct sock *sk)
* ... ...
* tp->rcv_nxt check sock_def_readable
* ... {
- * schedule ...
- * if (sk_sleep(sk) && waitqueue_active(sk_sleep(sk)))
- * wake_up_interruptible(sk_sleep(sk))
+ * schedule rcu_read_lock();
+ * wq = rcu_dereference(sk->sk_wq);
+ * if (wq && waitqueue_active(&wq->wait))
+ * wake_up_interruptible(&wq->wait)
* ...
* }
*
@@ -1421,28 +1422,27 @@ static inline int sk_has_allocations(const struct sock *sk)
* could then endup calling schedule and sleep forever if there are no more
* data on the socket.
*
- * The sk_has_sleeper is always called right after a call to read_lock, so we
- * can use smp_mb__after_lock barrier.
*/
-static inline int sk_has_sleeper(struct sock *sk)
+static inline bool wq_has_sleeper(struct socket_wq *wq)
{
+
/*
* We need to be sure we are in sync with the
* add_wait_queue modifications to the wait queue.
*
* This memory barrier is paired in the sock_poll_wait.
*/
- smp_mb__after_lock();
- return sk_sleep(sk) && waitqueue_active(sk_sleep(sk));
+ smp_mb();
+ return wq && waitqueue_active(&wq->wait);
}
-
+
/**
* sock_poll_wait - place memory barrier behind the poll_wait call.
* @filp: file
* @wait_address: socket wait queue
* @p: poll_table
*
- * See the comments in the sk_has_sleeper function.
+ * See the comments in the wq_has_sleeper function.
*/
static inline void sock_poll_wait(struct file *filp,
wait_queue_head_t *wait_address, poll_table *p)
@@ -1453,7 +1453,7 @@ static inline void sock_poll_wait(struct file *filp,
* We need to be sure we are in sync with the
* socket flags modification.
*
- * This memory barrier is paired in the sk_has_sleeper.
+ * This memory barrier is paired in the wq_has_sleeper.
*/
smp_mb();
}
diff --git a/net/atm/common.c b/net/atm/common.c
index e3e10e6..b43feb1 100644
--- a/net/atm/common.c
+++ b/net/atm/common.c
@@ -90,10 +90,13 @@ static void vcc_sock_destruct(struct sock *sk)
static void vcc_def_wakeup(struct sock *sk)
{
- read_lock(&sk->sk_callback_lock);
- if (sk_has_sleeper(sk))
- wake_up(sk_sleep(sk));
- read_unlock(&sk->sk_callback_lock);
+ struct socket_wq *wq;
+
+ rcu_read_lock();
+ wq = rcu_dereference(sk->sk_wq);
+ if (wq_has_sleeper(wq))
+ wake_up(&wq->wait);
+ rcu_read_unlock();
}
static inline int vcc_writable(struct sock *sk)
@@ -106,16 +109,19 @@ static inline int vcc_writable(struct sock *sk)
static void vcc_write_space(struct sock *sk)
{
- read_lock(&sk->sk_callback_lock);
+ struct socket_wq *wq;
+
+ rcu_read_lock();
if (vcc_writable(sk)) {
- if (sk_has_sleeper(sk))
- wake_up_interruptible(sk_sleep(sk));
+ wq = rcu_dereference(sk->sk_wq);
+ if (wq_has_sleeper(wq))
+ wake_up_interruptible(&wq->wait);
sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
}
- read_unlock(&sk->sk_callback_lock);
+ rcu_read_unlock();
}
static struct proto vcc_proto = {
diff --git a/net/core/sock.c b/net/core/sock.c
index 5104175..94c4aff 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1211,7 +1211,7 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
*/
sk_refcnt_debug_inc(newsk);
sk_set_socket(newsk, NULL);
- newsk->sk_sleep = NULL;
+ newsk->sk_wq = NULL;
if (newsk->sk_prot->sockets_allocated)
percpu_counter_inc(newsk->sk_prot->sockets_allocated);
@@ -1800,41 +1800,53 @@ EXPORT_SYMBOL(sock_no_sendpage);
static void sock_def_wakeup(struct sock *sk)
{
- read_lock(&sk->sk_callback_lock);
- if (sk_has_sleeper(sk))
- wake_up_interruptible_all(sk_sleep(sk));
- read_unlock(&sk->sk_callback_lock);
+ struct socket_wq *wq;
+
+ rcu_read_lock();
+ wq = rcu_dereference(sk->sk_wq);
+ if (wq_has_sleeper(wq))
+ wake_up_interruptible_all(&wq->wait);
+ rcu_read_unlock();
}
static void sock_def_error_report(struct sock *sk)
{
- read_lock(&sk->sk_callback_lock);
- if (sk_has_sleeper(sk))
- wake_up_interruptible_poll(sk_sleep(sk), POLLERR);
+ struct socket_wq *wq;
+
+ rcu_read_lock();
+ wq = rcu_dereference(sk->sk_wq);
+ if (wq_has_sleeper(wq))
+ wake_up_interruptible_poll(&wq->wait, POLLERR);
sk_wake_async(sk, SOCK_WAKE_IO, POLL_ERR);
- read_unlock(&sk->sk_callback_lock);
+ rcu_read_unlock();
}
static void sock_def_readable(struct sock *sk, int len)
{
- read_lock(&sk->sk_callback_lock);
- if (sk_has_sleeper(sk))
- wake_up_interruptible_sync_poll(sk_sleep(sk), POLLIN |
+ struct socket_wq *wq;
+
+ rcu_read_lock();
+ wq = rcu_dereference(sk->sk_wq);
+ if (wq_has_sleeper(wq))
+ wake_up_interruptible_sync_poll(&wq->wait, POLLIN |
POLLRDNORM | POLLRDBAND);
sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
- read_unlock(&sk->sk_callback_lock);
+ rcu_read_unlock();
}
static void sock_def_write_space(struct sock *sk)
{
- read_lock(&sk->sk_callback_lock);
+ struct socket_wq *wq;
+
+ rcu_read_lock();
/* Do not wake up a writer until he can make "significant"
* progress. --DaveM
*/
if ((atomic_read(&sk->sk_wmem_alloc) << 1) <= sk->sk_sndbuf) {
- if (sk_has_sleeper(sk))
- wake_up_interruptible_sync_poll(sk_sleep(sk), POLLOUT |
+ wq = rcu_dereference(sk->sk_wq);
+ if (wq_has_sleeper(wq))
+ wake_up_interruptible_sync_poll(&wq->wait, POLLOUT |
POLLWRNORM | POLLWRBAND);
/* Should agree with poll, otherwise some programs break */
@@ -1842,7 +1854,7 @@ static void sock_def_write_space(struct sock *sk)
sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
}
- read_unlock(&sk->sk_callback_lock);
+ rcu_read_unlock();
}
static void sock_def_destruct(struct sock *sk)
@@ -1896,10 +1908,10 @@ void sock_init_data(struct socket *sock, struct sock *sk)
if (sock) {
sk->sk_type = sock->type;
- sk->sk_sleep = &sock->wait;
+ sk->sk_wq = sock->wq;
sock->sk = sk;
} else
- sk->sk_sleep = NULL;
+ sk->sk_wq = NULL;
spin_lock_init(&sk->sk_dst_lock);
rwlock_init(&sk->sk_callback_lock);
diff --git a/net/core/stream.c b/net/core/stream.c
index 7b3c3f3..cc196f4 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -28,15 +28,19 @@
void sk_stream_write_space(struct sock *sk)
{
struct socket *sock = sk->sk_socket;
+ struct socket_wq *wq;
if (sk_stream_wspace(sk) >= sk_stream_min_wspace(sk) && sock) {
clear_bit(SOCK_NOSPACE, &sock->flags);
- if (sk_sleep(sk) && waitqueue_active(sk_sleep(sk)))
- wake_up_interruptible_poll(sk_sleep(sk), POLLOUT |
+ rcu_read_lock();
+ wq = rcu_dereference(sk->sk_wq);
+ if (wq_has_sleeper(wq))
+ wake_up_interruptible_poll(&wq->wait, POLLOUT |
POLLWRNORM | POLLWRBAND);
- if (sock->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN))
+ if (wq && wq->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN))
sock_wake_async(sock, SOCK_WAKE_SPACE, POLL_OUT);
+ rcu_read_unlock();
}
}
diff --git a/net/dccp/output.c b/net/dccp/output.c
index 2d3dcb3..aadbdb5 100644
--- a/net/dccp/output.c
+++ b/net/dccp/output.c
@@ -195,15 +195,17 @@ EXPORT_SYMBOL_GPL(dccp_sync_mss);
void dccp_write_space(struct sock *sk)
{
- read_lock(&sk->sk_callback_lock);
+ struct socket_wq *wq;
- if (sk_has_sleeper(sk))
- wake_up_interruptible(sk_sleep(sk));
+ rcu_read_lock();
+ wq = rcu_dereference(sk->sk_wq);
+ if (wq_has_sleeper(wq))
+ wake_up_interruptible(&wq->wait);
/* Should agree with poll, otherwise some programs break */
if (sock_writeable(sk))
sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
- read_unlock(&sk->sk_callback_lock);
+ rcu_read_unlock();
}
/**
diff --git a/net/iucv/af_iucv.c b/net/iucv/af_iucv.c
index 9636b7d..8be324f 100644
--- a/net/iucv/af_iucv.c
+++ b/net/iucv/af_iucv.c
@@ -305,11 +305,14 @@ static inline int iucv_below_msglim(struct sock *sk)
*/
static void iucv_sock_wake_msglim(struct sock *sk)
{
- read_lock(&sk->sk_callback_lock);
- if (sk_has_sleeper(sk))
- wake_up_interruptible_all(sk_sleep(sk));
+ struct socket_wq *wq;
+
+ rcu_read_lock();
+ wq = rcu_dereference(sk->sk_wq);
+ if (wq_has_sleeper(wq))
+ wake_up_interruptible_all(&wq->wait);
sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
- read_unlock(&sk->sk_callback_lock);
+ rcu_read_unlock();
}
/* Timers */
diff --git a/net/phonet/pep.c b/net/phonet/pep.c
index e2a9576..af4d38b 100644
--- a/net/phonet/pep.c
+++ b/net/phonet/pep.c
@@ -664,12 +664,12 @@ static int pep_wait_connreq(struct sock *sk, int noblock)
if (signal_pending(tsk))
return sock_intr_errno(timeo);
- prepare_to_wait_exclusive(&sk->sk_socket->wait, &wait,
+ prepare_to_wait_exclusive(sk_sleep(sk), &wait,
TASK_INTERRUPTIBLE);
release_sock(sk);
timeo = schedule_timeout(timeo);
lock_sock(sk);
- finish_wait(&sk->sk_socket->wait, &wait);
+ finish_wait(sk_sleep(sk), &wait);
}
return 0;
@@ -910,10 +910,10 @@ disabled:
goto out;
}
- prepare_to_wait(&sk->sk_socket->wait, &wait,
+ prepare_to_wait(sk_sleep(sk), &wait,
TASK_INTERRUPTIBLE);
done = sk_wait_event(sk, &timeo, atomic_read(&pn->tx_credits));
- finish_wait(&sk->sk_socket->wait, &wait);
+ finish_wait(sk_sleep(sk), &wait);
if (sk->sk_state != TCP_ESTABLISHED)
goto disabled;
diff --git a/net/phonet/socket.c b/net/phonet/socket.c
index c785bfd..6e9848b 100644
--- a/net/phonet/socket.c
+++ b/net/phonet/socket.c
@@ -265,7 +265,7 @@ static unsigned int pn_socket_poll(struct file *file, struct socket *sock,
struct pep_sock *pn = pep_sk(sk);
unsigned int mask = 0;
- poll_wait(file, &sock->wait, wait);
+ poll_wait(file, sk_sleep(sk), wait);
switch (sk->sk_state) {
case TCP_LISTEN:
diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c
index c432d76..0b9bb20 100644
--- a/net/rxrpc/af_rxrpc.c
+++ b/net/rxrpc/af_rxrpc.c
@@ -62,13 +62,15 @@ static inline int rxrpc_writable(struct sock *sk)
static void rxrpc_write_space(struct sock *sk)
{
_enter("%p", sk);
- read_lock(&sk->sk_callback_lock);
+ rcu_read_lock();
if (rxrpc_writable(sk)) {
- if (sk_has_sleeper(sk))
- wake_up_interruptible(sk_sleep(sk));
+ struct socket_wq *wq = rcu_dereference(sk->sk_wq);
+
+ if (wq_has_sleeper(wq))
+ wake_up_interruptible(&wq->wait);
sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
}
- read_unlock(&sk->sk_callback_lock);
+ rcu_read_unlock();
}
/*
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 13d8229..d54700a 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -6065,7 +6065,7 @@ static void __sctp_write_space(struct sctp_association *asoc)
* here by modeling from the current TCP/UDP code.
* We have not tested with it yet.
*/
- if (sock->fasync_list &&
+ if (sock->wq->fasync_list &&
!(sk->sk_shutdown & SEND_SHUTDOWN))
sock_wake_async(sock,
SOCK_WAKE_SPACE, POLL_OUT);
diff --git a/net/socket.c b/net/socket.c
index 9822081..a0a59cb 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -252,9 +252,14 @@ static struct inode *sock_alloc_inode(struct super_block *sb)
ei = kmem_cache_alloc(sock_inode_cachep, GFP_KERNEL);
if (!ei)
return NULL;
- init_waitqueue_head(&ei->socket.wait);
+ ei->socket.wq = kmalloc(sizeof(struct socket_wq), GFP_KERNEL);
+ if (!ei->socket.wq) {
+ kmem_cache_free(sock_inode_cachep, ei);
+ return NULL;
+ }
+ init_waitqueue_head(&ei->socket.wq->wait);
+ ei->socket.wq->fasync_list = NULL;
- ei->socket.fasync_list = NULL;
ei->socket.state = SS_UNCONNECTED;
ei->socket.flags = 0;
ei->socket.ops = NULL;
@@ -264,10 +269,21 @@ static struct inode *sock_alloc_inode(struct super_block *sb)
return &ei->vfs_inode;
}
+
+static void wq_free_rcu(struct rcu_head *head)
+{
+ struct socket_wq *wq = container_of(head, struct socket_wq, rcu);
+
+ kfree(wq);
+}
+
static void sock_destroy_inode(struct inode *inode)
{
- kmem_cache_free(sock_inode_cachep,
- container_of(inode, struct socket_alloc, vfs_inode));
+ struct socket_alloc *ei;
+
+ ei = container_of(inode, struct socket_alloc, vfs_inode);
+ call_rcu(&ei->socket.wq->rcu, wq_free_rcu);
+ kmem_cache_free(sock_inode_cachep, ei);
}
static void init_once(void *foo)
@@ -513,7 +529,7 @@ void sock_release(struct socket *sock)
module_put(owner);
}
- if (sock->fasync_list)
+ if (sock->wq->fasync_list)
printk(KERN_ERR "sock_release: fasync list not empty!\n");
percpu_sub(sockets_in_use, 1);
@@ -1080,9 +1096,9 @@ static int sock_fasync(int fd, struct file *filp, int on)
lock_sock(sk);
- fasync_helper(fd, filp, on, &sock->fasync_list);
+ fasync_helper(fd, filp, on, &sock->wq->fasync_list);
- if (!sock->fasync_list)
+ if (!sock->wq->fasync_list)
sock_reset_flag(sk, SOCK_FASYNC);
else
sock_set_flag(sk, SOCK_FASYNC);
@@ -1091,12 +1107,20 @@ static int sock_fasync(int fd, struct file *filp, int on)
return 0;
}
-/* This function may be called only under socket lock or callback_lock */
+/* This function may be called only under socket lock or callback_lock or rcu_lock */
int sock_wake_async(struct socket *sock, int how, int band)
{
- if (!sock || !sock->fasync_list)
+ struct socket_wq *wq;
+
+ if (!sock)
return -1;
+ rcu_read_lock();
+ wq = rcu_dereference(sock->wq);
+ if (!wq || !wq->fasync_list) {
+ rcu_read_unlock();
+ return -1;
+ }
switch (how) {
case SOCK_WAKE_WAITD:
if (test_bit(SOCK_ASYNC_WAITDATA, &sock->flags))
@@ -1108,11 +1132,12 @@ int sock_wake_async(struct socket *sock, int how, int band)
/* fall through */
case SOCK_WAKE_IO:
call_kill:
- kill_fasync(&sock->fasync_list, SIGIO, band);
+ kill_fasync(&wq->fasync_list, SIGIO, band);
break;
case SOCK_WAKE_URG:
- kill_fasync(&sock->fasync_list, SIGURG, band);
+ kill_fasync(&wq->fasync_list, SIGURG, band);
}
+ rcu_read_unlock();
return 0;
}
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 87c0360..fef2cc5 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -313,13 +313,16 @@ static inline int unix_writable(struct sock *sk)
static void unix_write_space(struct sock *sk)
{
- read_lock(&sk->sk_callback_lock);
+ struct socket_wq *wq;
+
+ rcu_read_lock();
if (unix_writable(sk)) {
- if (sk_has_sleeper(sk))
- wake_up_interruptible_sync(sk_sleep(sk));
+ wq = rcu_dereference(sk->sk_wq);
+ if (wq_has_sleeper(wq))
+ wake_up_interruptible_sync(&wq->wait);
sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
}
- read_unlock(&sk->sk_callback_lock);
+ rcu_read_unlock();
}
/* When dgram socket disconnects (or changes its peer), we clear its receive
@@ -406,9 +409,7 @@ static int unix_release_sock(struct sock *sk, int embrion)
skpair->sk_err = ECONNRESET;
unix_state_unlock(skpair);
skpair->sk_state_change(skpair);
- read_lock(&skpair->sk_callback_lock);
sk_wake_async(skpair, SOCK_WAKE_WAITD, POLL_HUP);
- read_unlock(&skpair->sk_callback_lock);
}
sock_put(skpair); /* It may now die */
unix_peer(sk) = NULL;
@@ -1142,7 +1143,7 @@ restart:
newsk->sk_peercred.pid = task_tgid_vnr(current);
current_euid_egid(&newsk->sk_peercred.uid, &newsk->sk_peercred.gid);
newu = unix_sk(newsk);
- newsk->sk_sleep = &newu->peer_wait;
+ newsk->sk_wq = &newu->peer_wq;
otheru = unix_sk(other);
/* copy address information from listening to new sock*/
@@ -1931,12 +1932,10 @@ static int unix_shutdown(struct socket *sock, int mode)
other->sk_shutdown |= peer_mode;
unix_state_unlock(other);
other->sk_state_change(other);
- read_lock(&other->sk_callback_lock);
if (peer_mode == SHUTDOWN_MASK)
sk_wake_async(other, SOCK_WAKE_WAITD, POLL_HUP);
else if (peer_mode & RCV_SHUTDOWN)
sk_wake_async(other, SOCK_WAKE_WAITD, POLL_IN);
- read_unlock(&other->sk_callback_lock);
}
if (other)
sock_put(other);
^ permalink raw reply related
* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: Eric Dumazet @ 2010-04-29 13:49 UTC (permalink / raw)
To: hadi
Cc: Changli Gao, David Miller, therbert, shemminger, netdev,
Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272548258.4258.185.camel@bigi>
Le jeudi 29 avril 2010 à 09:37 -0400, jamal a écrit :
> On Thu, 2010-04-29 at 15:21 +0200, Eric Dumazet wrote:
>
>
> >
> > You could try following program :
> >
>
> Will do later today (test machine is not on the network and is about 20
> minutes from here; so worst case i will get you results by end of day)
> I guess this program is good enough since it tells me the system wide
> ipi count - what my patch did was also to break it down by which cpu got
> how many IPIs (served to check if there was uneven distribution)
>
> >
> > Is your application mono threaded and receiving data to 8 sockets ?
> >
>
> I fork one instance per detected cpu and bind to different ports each
> time. Example bind to port 8200 on cpu0, 8201 on cpu1, etc.
>
I guess this is the problem ;)
With RPS, you should not bind your threads to cpu.
This is the rps hash who will decide for you.
I am using following program :
/*
* Usage: udpsink [ -p baseport] nbports
*
*/
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <string.h>
#include <stdio.h>
#include <errno.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
struct worker_data {
int fd;
unsigned long pack_count;
unsigned long bytes_count;
unsigned long _padd[16 - 3]; /* alignment */
};
void usage(int code)
{
fprintf(stderr, "Usage: udpsink [-p baseport] nbports\n");
exit(code);
}
void *worker_func(void *arg)
{
struct worker_data *wdata = (struct worker_data *)arg;
char buffer[4096];
struct sockaddr_in addr;
int lu;
while (1) {
socklen_t len = sizeof(addr);
lu = recvfrom(wdata->fd, buffer, sizeof(buffer), 0, (struct sockaddr *)&addr, &len);
if (lu > 0) {
wdata->pack_count++;
wdata->bytes_count += lu;
}
}
}
int main(int argc, char *argv[])
{
int c;
int baseport = 4000;
int nbthreads;
struct worker_data *wdata;
unsigned long ototal = 0;
int concurrent = 0;
int verbose = 0;
int i;
while ((c = getopt(argc, argv, "cvp:")) != -1) {
if (c == 'p')
baseport = atoi(optarg);
else if (c == 'c')
concurrent = 1;
else if (c == 'v')
verbose++;
else usage(1);
}
if (optind == argc)
usage(1);
nbthreads = atoi(argv[optind]);
wdata = calloc(sizeof(struct worker_data), nbthreads);
if (!wdata) {
perror("calloc");
return 1;
}
for (i = 0; i < nbthreads; i++) {
struct sockaddr_in addr;
pthread_t tid;
if (i && concurrent) {
wdata[i].fd = wdata[0].fd ;
} else {
wdata[i].fd = socket(PF_INET, SOCK_DGRAM, 0);
if (wdata[i].fd == -1) {
perror("socket");
return 1;
}
memset(&addr, 0, sizeof(addr));
addr.sin_family = AF_INET;
// addr.sin_addr.s_addr = inet_addr(argv[optind]);
addr.sin_port = htons(baseport + i);
if (bind(wdata[i].fd, (struct sockaddr *) &addr, sizeof(addr)) < 0) {
perror("bind");
return 1;
}
// fcntl(wdata[i].fd, F_SETFL, O_NDELAY);
}
pthread_create(&tid, NULL, worker_func, wdata + i);
}
for (;;) {
unsigned long total;
long delta;
sleep(1);
total = 0;
for (i = 0; i < nbthreads;i++) {
total += wdata[i].pack_count;
}
delta = total - ototal;
if (delta) {
printf("%lu pps (%lu", delta, total);
if (verbose) {
for (i = 0; i < nbthreads;i++) {
if (wdata[i].pack_count)
printf(" %d:%lu", i, wdata[i].pack_count);
}
}
printf(")\n");
}
ototal = total;
}
}
^ permalink raw reply
* Re: [PATCH] [RFC] C/R: inet4 and inet6 unicast routes (v2)
From: Dan Smith @ 2010-04-30 18:25 UTC (permalink / raw)
To: Serge E. Hallyn; +Cc: containers, Vlad Yasevich, netdev, David Miller
In-Reply-To: <20100430181946.GA26761@us.ibm.com>
SH> So I'm afraid you're going to have to do a slightly uglier thing
SH> where you unshare_nsproxy_namespaces() and then
SH> switch_task_namespaces() to the new nsproxy.
Well, I think that would be hidden in the nicer helper function I
think I'll need, which I eluded to in the patch header. This is just
an RFC proof that it can be done in this manner, but I think a
separate helper in nsproxy.c is in order to make it nice (and avoid
the extra alloc/free of the netns that copy_namespaces() will create).
Agreed?
Thanks!
--
Dan Smith
IBM Linux Technology Center
email: danms@us.ibm.com
^ permalink raw reply
* Re: [PATCH] [RFC] C/R: inet4 and inet6 unicast routes (v2)
From: Serge E. Hallyn @ 2010-04-30 18:37 UTC (permalink / raw)
To: Dan Smith; +Cc: containers, Vlad Yasevich, netdev, David Miller
In-Reply-To: <87mxwkztjf.fsf@caffeine.danplanet.com>
Quoting Dan Smith (danms@us.ibm.com):
> SH> So I'm afraid you're going to have to do a slightly uglier thing
> SH> where you unshare_nsproxy_namespaces() and then
> SH> switch_task_namespaces() to the new nsproxy.
>
> Well, I think that would be hidden in the nicer helper function I
> think I'll need, which I eluded to in the patch header. This is just
> an RFC proof that it can be done in this manner, but I think a
> separate helper in nsproxy.c is in order to make it nice (and avoid
> the extra alloc/free of the netns that copy_namespaces() will create).
> Agreed?
Yup - thanks!
-serge
^ permalink raw reply
* Re: [PATCH 5/5] sctp: Fix oops when sending queued ASCONF chunks
From: Shuaijun Zhang @ 2010-04-29 14:09 UTC (permalink / raw)
To: Vlad Yasevich; +Cc: netdev, davem, linux-sctp, Yuansong Qiao
In-Reply-To: <1272480442-32673-6-git-send-email-vladislav.yasevich@hp.com>
Vlad Yasevich wrote:
> When we finish processing ASCONF_ACK chunk, we try to send
> the next queued ASCONF. This action runs the sctp state
> machine recursively and it's not prepared to do so.
>
> kernel BUG at kernel/timer.c:790!
> invalid opcode: 0000 [#1] SMP
> last sysfs file: /sys/module/ipv6/initstate
> Modules linked in: sha256_generic sctp libcrc32c ipv6 dm_multipath
> uinput 8139too i2c_piix4 8139cp mii i2c_core pcspkr virtio_net joydev
> floppy virtio_blk virtio_pci [last unloaded: scsi_wait_scan]
>
> Pid: 0, comm: swapper Not tainted 2.6.34-rc4 #15 /Bochs
> EIP: 0060:[<c044a2ef>] EFLAGS: 00010286 CPU: 0
> EIP is at add_timer+0xd/0x1b
> EAX: cecbab14 EBX: 000000f0 ECX: c0957b1c EDX: 03595cf4
> ESI: cecba800 EDI: cf276f00 EBP: c0957aa0 ESP: c0957aa0
> DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> Process swapper (pid: 0, ti=c0956000 task=c0988ba0 task.ti=c0956000)
> Stack:
> c0957ae0 d1851214 c0ab62e4 c0ab5f26 0500ffff 00000004 00000005 00000004
> <0> 00000000 d18694fd 00000004 1666b892 cecba800 cecba800 c0957b14
> 00000004
> <0> c0957b94 d1851b11 ceda8b00 cecba800 cf276f00 00000001 c0957b14
> 000000d0
>
According to the call trace below, it seems that our modification did
not take affect.
sctp_primitive_ASCONF should be invoked after sctp_side_effects().
Our code fixed the same problem in kernel 2.6.27.28.
Not sure about the difference between 2.6.34-rc4 kernel and 2.6.27.28
kernel.
> Call Trace:
> [<d1851214>] ? sctp_side_effects+0x607/0xdfc [sctp]
> [<d1851b11>] ? sctp_do_sm+0x108/0x159 [sctp]
> [<d1863386>] ? sctp_pname+0x0/0x1d [sctp]
> [<d1861a56>] ? sctp_primitive_ASCONF+0x36/0x3b [sctp] <--- sctp_side_effects() should show up here before send next asconf
> [<d185657c>] ? sctp_process_asconf_ack+0x2a4/0x2d3 [sctp]
> [<d184e35c>] ? sctp_sf_do_asconf_ack+0x1dd/0x2b4 [sctp]
> [<d1851ac1>] ? sctp_do_sm+0xb8/0x159 [sctp]
> [<d1863334>] ? sctp_cname+0x0/0x52 [sctp]
> [<d1854377>] ? sctp_assoc_bh_rcv+0xac/0xe1 [sctp]
> [<d1858f0f>] ? sctp_inq_push+0x2d/0x30 [sctp]
> [<d186329d>] ? sctp_rcv+0x797/0x82e [sctp]
>
> Tested-by: Wei Yongjun <yjwei@cn.fujitsu.com>
> Signed-off-by: Yuansong Qiao <ysqiao@research.ait.ie>
> Signed-off-by: Shuaijun Zhang <szhang@research.ait.ie>
> Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
> ---
> include/net/sctp/command.h | 1 +
> net/sctp/sm_make_chunk.c | 15 ---------------
> net/sctp/sm_sideeffect.c | 26 ++++++++++++++++++++++++++
> net/sctp/sm_statefuns.c | 8 +++++++-
> 4 files changed, 34 insertions(+), 16 deletions(-)
>
> diff --git a/include/net/sctp/command.h b/include/net/sctp/command.h
> index 8be5135..2c55a7e 100644
> --- a/include/net/sctp/command.h
> +++ b/include/net/sctp/command.h
> @@ -107,6 +107,7 @@ typedef enum {
> SCTP_CMD_T1_RETRAN, /* Mark for retransmission after T1 timeout */
> SCTP_CMD_UPDATE_INITTAG, /* Update peer inittag */
> SCTP_CMD_SEND_MSG, /* Send the whole use message */
> + SCTP_CMD_SEND_NEXT_ASCONF, /* Send the next ASCONF after ACK */
> SCTP_CMD_LAST
> } sctp_verb_t;
>
> diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
> index f6fc5c1..0fd5b4c 100644
> --- a/net/sctp/sm_make_chunk.c
> +++ b/net/sctp/sm_make_chunk.c
> @@ -3318,21 +3318,6 @@ int sctp_process_asconf_ack(struct sctp_association *asoc,
> sctp_chunk_free(asconf);
> asoc->addip_last_asconf = NULL;
>
> - /* Send the next asconf chunk from the addip chunk queue. */
> - if (!list_empty(&asoc->addip_chunk_list)) {
> - struct list_head *entry = asoc->addip_chunk_list.next;
> - asconf = list_entry(entry, struct sctp_chunk, list);
> -
> - list_del_init(entry);
> -
> - /* Hold the chunk until an ASCONF_ACK is received. */
> - sctp_chunk_hold(asconf);
> - if (sctp_primitive_ASCONF(asoc, asconf))
> - sctp_chunk_free(asconf);
> - else
> - asoc->addip_last_asconf = asconf;
> - }
> -
> return retval;
> }
>
> diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
> index 4c5bed9..d5ae450 100644
> --- a/net/sctp/sm_sideeffect.c
> +++ b/net/sctp/sm_sideeffect.c
> @@ -962,6 +962,29 @@ static int sctp_cmd_send_msg(struct sctp_association *asoc,
> }
>
>
> +/* Sent the next ASCONF packet currently stored in the association.
> + * This happens after the ASCONF_ACK was succeffully processed.
> + */
> +static void sctp_cmd_send_asconf(struct sctp_association *asoc)
> +{
> + /* Send the next asconf chunk from the addip chunk
> + * queue.
> + */
> + if (!list_empty(&asoc->addip_chunk_list)) {
> + struct list_head *entry = asoc->addip_chunk_list.next;
> + struct sctp_chunk *asconf = list_entry(entry,
> + struct sctp_chunk, list);
> + list_del_init(entry);
> +
> + /* Hold the chunk until an ASCONF_ACK is received. */
> + sctp_chunk_hold(asconf);
> + if (sctp_primitive_ASCONF(asoc, asconf))
> + sctp_chunk_free(asconf);
> + else
> + asoc->addip_last_asconf = asconf;
> + }
> +}
> +
>
> /* These three macros allow us to pull the debugging code out of the
> * main flow of sctp_do_sm() to keep attention focused on the real
> @@ -1617,6 +1640,9 @@ static int sctp_cmd_interpreter(sctp_event_t event_type,
> }
> error = sctp_cmd_send_msg(asoc, cmd->obj.msg);
> break;
> + case SCTP_CMD_SEND_NEXT_ASCONF:
> + sctp_cmd_send_asconf(asoc);
> + break;
> default:
> printk(KERN_WARNING "Impossible command: %u, %p\n",
> cmd->verb, cmd->obj.ptr);
> diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
> index abf601a..24b2cd5 100644
> --- a/net/sctp/sm_statefuns.c
> +++ b/net/sctp/sm_statefuns.c
> @@ -3676,8 +3676,14 @@ sctp_disposition_t sctp_sf_do_asconf_ack(const struct sctp_endpoint *ep,
> SCTP_TO(SCTP_EVENT_TIMEOUT_T4_RTO));
>
> if (!sctp_process_asconf_ack((struct sctp_association *)asoc,
> - asconf_ack))
> + asconf_ack)) {
> + /* Successfully processed ASCONF_ACK. We can
> + * release the next asconf if we have one.
> + */
> + sctp_add_cmd_sf(commands, SCTP_CMD_SEND_NEXT_ASCONF,
> + SCTP_NULL());
> return SCTP_DISPOSITION_CONSUME;
> + }
>
> abort = sctp_make_abort(asoc, asconf_ack,
> sizeof(sctp_errhdr_t));
>
^ permalink raw reply
* [PATCH 1/2] ppp_generic: pull 2 bytes so that PPP_PROTO(skb) is valid
From: Simon Arlott @ 2010-04-30 18:41 UTC (permalink / raw)
To: netdev; +Cc: paulus, linux-ppp
In ppp_input(), PPP_PROTO(skb) may refer to invalid data in the skb.
If this happens and (proto >= 0xc000 || proto == PPP_CCPFRAG) then
the packet is passed directly to pppd.
This occurs frequently when using PPPoE with an interface MTU
greater than 1500 because the skb is more likely to be non-linear.
The next 2 bytes need to be pulled in ppp_input(). The pull of 2
bytes in ppp_receive_frame() has been removed as it is no longer
required.
Signed-off-by: Simon Arlott <simon@fire.lp0.eu>
---
Tested with PPPoE over e1000 at MTU 16110.
drivers/net/ppp_generic.c | 28 ++++++++++++++++++----------
1 files changed, 18 insertions(+), 10 deletions(-)
diff --git a/drivers/net/ppp_generic.c b/drivers/net/ppp_generic.c
index 6e281bc..fdd8deb 100644
--- a/drivers/net/ppp_generic.c
+++ b/drivers/net/ppp_generic.c
@@ -1572,8 +1572,18 @@ ppp_input(struct ppp_channel *chan, struct sk_buff *skb)
return;
}
- proto = PPP_PROTO(skb);
+
read_lock_bh(&pch->upl);
+ if (!pskb_may_pull(skb, 2)) {
+ kfree_skb(skb);
+ if (pch->ppp) {
+ ++pch->ppp->dev->stats.rx_length_errors;
+ ppp_receive_error(pch->ppp);
+ }
+ goto done;
+ }
+
+ proto = PPP_PROTO(skb);
if (!pch->ppp || proto >= 0xc000 || proto == PPP_CCPFRAG) {
/* put it on the channel queue */
skb_queue_tail(&pch->file.rq, skb);
@@ -1585,6 +1595,8 @@ ppp_input(struct ppp_channel *chan, struct sk_buff *skb)
} else {
ppp_do_recv(pch->ppp, skb, pch);
}
+
+done:
read_unlock_bh(&pch->upl);
}
@@ -1617,7 +1629,8 @@ ppp_input_error(struct ppp_channel *chan, int code)
static void
ppp_receive_frame(struct ppp *ppp, struct sk_buff *skb, struct channel *pch)
{
- if (pskb_may_pull(skb, 2)) {
+ /* note: a 0-length skb is used as an error indication */
+ if (skb->len > 0) {
#ifdef CONFIG_PPP_MULTILINK
/* XXX do channel-level decompression here */
if (PPP_PROTO(skb) == PPP_MP)
@@ -1625,15 +1638,10 @@ ppp_receive_frame(struct ppp *ppp, struct sk_buff *skb, struct channel *pch)
else
#endif /* CONFIG_PPP_MULTILINK */
ppp_receive_nonmp_frame(ppp, skb);
- return;
+ } else {
+ kfree_skb(skb);
+ ppp_receive_error(ppp);
}
-
- if (skb->len > 0)
- /* note: a 0-length skb is used as an error indication */
- ++ppp->dev->stats.rx_length_errors;
-
- kfree_skb(skb);
- ppp_receive_error(ppp);
}
static void
--
1.7.0.4
--
Simon Arlott
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox