Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v2] xtables: make XT_ALIGN() usable in exported headers by exporting __ALIGN_KERNEL()
From: Alexey Dobriyan @ 2010-04-13 11:50 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: linux-kernel, netdev, shemminger, bhutchings, andreas, hadi,
	hideaki
In-Reply-To: <4BC450A4.1010200@trash.net>

On Tue, Apr 13, 2010 at 01:08:20PM +0200, Patrick McHardy wrote:
> Alexey Dobriyan wrote:
> > XT_ALIGN() was rewritten through ALIGN() by commit 42107f5009da223daa800d6da6904d77297ae829
> > "netfilter: xtables: symmetric COMPAT_XT_ALIGN definition".
> > ALIGN() is not exported in userspace headers, which created compile problem for tc(8)
> > and will create problem for iptables(8).
> > 
> > We can't export generic looking name ALIGN() but we can export less generic
> > __ALIGN_KERNEL() (suggested by Ben Hutchings).
> > Google knows nothing about __ALIGN_KERNEL().
> > 
> > COMPAT_XT_ALIGN() changed for symmetry.
> 
> I've already pushed your change out, could you send me an incremental
> fix please?
> 
> master.kernel.org:/pub/scm/linux/kernel/git/kaber/nf-next-2.6.git

[PATCH] Restore __ALIGN_MASK()

Fix lib/bitmap.c compile failure due to __ALIGN_KERNEL changes.

---
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -40,6 +40,7 @@ extern const char linux_proc_banner[];
 #define STACK_MAGIC	0xdeadbeef
 
 #define ALIGN(x, a)		__ALIGN_KERNEL((x), (a))
+#define __ALIGN_MASK(x, mask)	__ALIGN_KERNEL_MASK((x), (mask))
 #define PTR_ALIGN(p, a)		((typeof(p))ALIGN((unsigned long)(p), (a)))
 #define IS_ALIGNED(x, a)		(((x) & ((typeof(x))(a) - 1)) == 0)
 

^ permalink raw reply

* Re: forcedeth driver hangs under heavy load
From: Ben Hutchings @ 2010-04-13 12:04 UTC (permalink / raw)
  To: stephen mulcahy; +Cc: Eric Dumazet, netdev, Ben Hutchings, Ayaz Abdulla, 572201
In-Reply-To: <4BC44EC8.1010104@gmail.com>

On Tue, 2010-04-13 at 12:00 +0100, stephen mulcahy wrote:
> Eric Dumazet wrote:
> > Le mardi 13 avril 2010 à 11:03 +0100, stephen mulcahy a écrit :
> >> Eric Dumazet wrote:
> >>> OK it seems forcedeth has problem with checksums ?
> >>>
> >>> Try to change "ethtool -k eth0" settings ?
> >>>
> >>> ethtool -K eth0 tso off tx off
> >> Yes, that makes an unresponsive system responsive again immediately, nice!
> >>
> >> Should the driver default to disabling this until we problem is corrected?
> >>
> >> -stephen
> > 
> > Both flags need to be disabled, or only one is OK ?
> 
> ethtool -K eth0 tx off
> 
> fixes the problem (without tso)
> 
> but running
> 
> ethtool -k eth0
> Offload parameters for eth0:
> rx-checksumming: on
> tx-checksumming: off
> scatter-gather: off
> tcp-segmentation-offload: off
> udp-fragmentation-offload: off
> generic-segmentation-offload: on
> generic-receive-offload: off
> large-receive-offload: off
> 
> seems to indicate that tso is also disabled by this - does that sound 
> correct?

That's correct - TSO requires TX offload.  What happens if you only turn
off TSO?

Ben. (wearing another hat)

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH v2] xtables: make XT_ALIGN() usable in exported headers by exporting __ALIGN_KERNEL()
From: Patrick McHardy @ 2010-04-13 12:10 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: linux-kernel, netdev, shemminger, bhutchings, andreas, hadi,
	hideaki
In-Reply-To: <20100414125007.GB25686@x200>

Alexey Dobriyan wrote:
> On Tue, Apr 13, 2010 at 01:08:20PM +0200, Patrick McHardy wrote:
>> Alexey Dobriyan wrote:
>>> XT_ALIGN() was rewritten through ALIGN() by commit 42107f5009da223daa800d6da6904d77297ae829
>>> "netfilter: xtables: symmetric COMPAT_XT_ALIGN definition".
>>> ALIGN() is not exported in userspace headers, which created compile problem for tc(8)
>>> and will create problem for iptables(8).
>>>
>>> We can't export generic looking name ALIGN() but we can export less generic
>>> __ALIGN_KERNEL() (suggested by Ben Hutchings).
>>> Google knows nothing about __ALIGN_KERNEL().
>>>
>>> COMPAT_XT_ALIGN() changed for symmetry.
>> I've already pushed your change out, could you send me an incremental
>> fix please?
>>
>> master.kernel.org:/pub/scm/linux/kernel/git/kaber/nf-next-2.6.git
> 
> [PATCH] Restore __ALIGN_MASK()
> 
> Fix lib/bitmap.c compile failure due to __ALIGN_KERNEL changes.

Applied.

^ permalink raw reply

* Re: SO_REUSEADDR with UDP (again)
From: Eric Dumazet @ 2010-04-13 12:21 UTC (permalink / raw)
  To: Michal Svoboda; +Cc: netdev
In-Reply-To: <20100413112726.GB16595@myhost.felk.cvut.cz>

Le mardi 13 avril 2010 à 13:27 +0200, Michal Svoboda a écrit :
> Eric Dumazet wrote:
> > Why do you use REUSEADDR ? This is doing what is documented.
> > 
> >        SO_REUSEADDR
> >               Indicates that the rules used in validating addresses  supplied
> >               in  a  bind(2) call should allow reuse of local addresses.  For
> >               AF_INET sockets this means that a socket may bind, except  when
> >               there is an active listening socket bound to the address.  When
> >               the listening socket is bound to  INADDR_ANY  with  a  specific
> >               port then it is not possible to bind to this port for any local
> >               address.  Argument is an integer boolean flag.
> 
> I read it 10 times but it doesn't say anything about stealing frames, or
> implementation-defined behavior in this case.

If it is not documented, it is implementation defined.

> 
> > An UDP application wanting a port for its exclusive use dont set
> > REUSEADDR, or basically allows anybody to bind an udp socket to same
> > port, and potentially steal incoming frames.
> 
> That's fair enough, I will talk to the developers of the "very buggy"
> applications that use this flag and ask them to reconsider.

;)

>  
> > REUSEADDR is usually used when an application has several sockets bound
> > to same port, but different IP addresses (or bound to different devices)
> 
> I just tried that and you can bind to different IPs without REUSEADDR.

Of course it is possible !

REUSEADDR allows following :

(Note that both sockets MUST have requested REUSEADDR=1)


#include <sys/socket.h>
#include <netinet/in.h>
#include <string.h>

main()
{
int sock1, sock2;
struct sockaddr_in addr;
int on = 1;

memset(&addr, 0, sizeof(addr));
addr.sin_port = htons(3444);
addr.sin_family = AF_INET;

sock1 = socket(AF_INET, SOCK_DGRAM, 0);
setsockopt(sock1, SOL_SOCKET, SO_REUSEADDR, &on, sizeof(on));
addr.sin_addr.s_addr = htonl(0x7f000001);
if (bind(sock1, (struct sockaddr *)&addr, sizeof(addr)))
	perror("bind1");

sock2 = socket(AF_INET, SOCK_DGRAM, 0);
setsockopt(sock2, SOL_SOCKET, SO_REUSEADDR, &on, sizeof(on));
addr.sin_addr.s_addr = INADDR_ANY; /* or htonl(0x7f000001); */
if (bind(sock2, (struct sockaddr *)&addr, sizeof(addr)))
	perror("bind2");
}



If an application didnt specified REUSEADDR=1, then its UDP port is
private, it cannot be stolen.

Therefore, applications should not use REUSEADDR on unicast UDP, unless
it is a non security issue (for example, if it is able to react to any
new IP addresses added by the administrator on the machine, and complain
loudly if another application could bind() before itself)

REUSADDR has a meaning for multicast, but for unicast... this is hardly
useful ?


About the connect() thing, its also a fact that connected sockets have a
higher priority (they'll receive incoming frames, their score his higher
than a non connected socket, if source of the packet matches the connect
destination of course). Same thing if you play with BINDTODEVICE.




^ permalink raw reply

* Re: Strange packet drops with heavy firewalling
From: Paweł Staszewski @ 2010-04-13 12:33 UTC (permalink / raw)
  To: Changli Gao; +Cc: Benny Amorsen, zhigang gong, netdev
In-Reply-To: <u2y412e6f7f1004121618p6d6eff30q8a45a03faa59a912@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2348 bytes --]

W dniu 2010-04-13 01:18, Changli Gao pisze:
> On Tue, Apr 13, 2010 at 1:06 AM, Benny Amorsen<benny+usenet@amorsen.dk>  wrote:
>    
>>   99:         24    1306226          3          2   PCI-MSI-edge      eth1-tx-0
>>   100:      15735    1648774          3          7   PCI-MSI-edge      eth1-tx-1
>>   101:          8         11          9    1083022   PCI-MSI-edge      eth1-tx-2
>>   102:          0          0          0          0   PCI-MSI-edge      eth1-tx-3
>>   103:         18         15       6131    1095383   PCI-MSI-edge      eth1-rx-0
>>   104:        217         32      46544    1335325   PCI-MSI-edge      eth1-rx-1
>>   105:        154    1305595        218         16   PCI-MSI-edge      eth1-rx-2
>>   106:         17         16       8229    1467509   PCI-MSI-edge      eth1-rx-3
>>   107:          0          0          1          0   PCI-MSI-edge      eth1
>>   108:          2         14         15    1003053   PCI-MSI-edge      eth0-tx-0
>>   109:       8226    1668924        478        487   PCI-MSI-edge      eth0-tx-1
>>   110:          3    1188874         17         12   PCI-MSI-edge      eth0-tx-2
>>   111:          0          0          0          0   PCI-MSI-edge      eth0-tx-3
>>   112:        203        185       5324    1015263   PCI-MSI-edge      eth0-rx-0
>>   113:       4141    1600793        153        159   PCI-MSI-edge      eth0-rx-1
>>   114:      16242    1210108        436       3124   PCI-MSI-edge      eth0-rx-2
>>   115:        267       4173      19471    1321252   PCI-MSI-edge      eth0-rx-3
>>   116:          0          1          0          0   PCI-MSI-edge      eth0
>>
>>
>> irqbalanced seems to have picked CPU1 and CPU3 for all the interrupts,
>> which to my mind should cause the same problem as before (where CPU1 and
>> CPU3 was handling all packets). Yet the box clearly works much better
>> than before.
>>      
> irqbalanced? I don't think it can work properly. Try RPS in netdev and
> linux-next tree, and if cpu load isn't even, try this patch:
> http://patchwork.ozlabs.org/patch/49915/ .
>
>
>    
Yes without irqbalance - and with irq affinity set by hand router will 
work much better.

But I don't think that RPS will help him - I make some tests with RPS 
and AFFINITY - results in attached file.
Test router make traffic management (hfsc) for almost 9k users





[-- Attachment #2: RPS_AFFINITY_TEST.txt --]
[-- Type: text/plain, Size: 5028 bytes --]

##############################################################################
eth0 -> CPU0
eth1 -> CPU5
RPS:
echo 00e0 > /sys/class/net/eth1/queues/rx-0/rps_cpus
echo 000e > /sys/class/net/eth0/queues/rx-0/rps_cpus

------------------------------------------------------------------------------
   PerfTop:   85205 irqs/sec  kernel:97.1% [100000 cycles],  (all, 8 CPUs)
------------------------------------------------------------------------------

             samples    pcnt   kernel function
             _______   _____   _______________

           214930.00 - 24.5% : _raw_spin_lock
            63844.00 -  7.3% : u32_classify
            48381.00 -  5.5% : e1000_clean
            47754.00 -  5.5% : rb_next
            37222.00 -  4.2% : e1000_intr_msi
            26295.00 -  3.0% : hfsc_enqueue
            17371.00 -  2.0% : rb_erase
            15290.00 -  1.7% : _raw_spin_lock_irqsave
            14958.00 -  1.7% : rb_insert_color
            14439.00 -  1.6% : update_vf
            14384.00 -  1.6% : e1000_xmit_frame
            14356.00 -  1.6% : hfsc_dequeue
            13804.00 -  1.6% : e1000_clean_tx_irq
            13413.00 -  1.5% : ipt_do_table
             9654.00 -  1.1% : ip_route_input

##############################################################################
eth0 -> CPU0
eth1 -> CPU5
NO RPS

------------------------------------------------------------------------------
   PerfTop:   33800 irqs/sec  kernel:96.9% [100000 cycles],  (all, 8 CPUs)
------------------------------------------------------------------------------

             samples    pcnt   kernel function
             _______   _____   _______________

            19361.00 - 11.2% : e1000_clean
            16424.00 -  9.5% : rb_next
            13060.00 -  7.5% : e1000_intr_msi
             7293.00 -  4.2% : u32_classify
             6875.00 -  4.0% : ipt_do_table
             5811.00 -  3.4% : _raw_spin_lock
             5754.00 -  3.3% : e1000_xmit_frame
             5671.00 -  3.3% : hfsc_dequeue
             4503.00 -  2.6% : __alloc_skb
             4156.00 -  2.4% : hfsc_enqueue
             4090.00 -  2.4% : e1000_clean_tx_irq
             3809.00 -  2.2% : e1000_clean_rx_irq
             3424.00 -  2.0% : update_vf
             3028.00 -  1.7% : rb_erase
             2714.00 -  1.6% : ip_route_input

##############################################################################
eth0 -> CPU0,CPU1,CPU2,CPU4 -> affinity echo 0f > /proc/irq/30/smp_affinity
eth1 -> CPU5,CPU6,CPU7,CPU8 -> affinity echo f0 > /proc/irq/31/smp_affinity
NO RPS
------------------------------------------------------------------------------
   PerfTop:   42362 irqs/sec  kernel:96.0% [100000 cycles],  (all, 8 CPUs)
------------------------------------------------------------------------------

             samples    pcnt   kernel function
             _______   _____   _______________

            33815.00 - 10.6% : rb_next
            21357.00 -  6.7% : u32_classify
            14525.00 -  4.6% : _raw_spin_lock
            14346.00 -  4.5% : e1000_clean
            12798.00 -  4.0% : hfsc_enqueue
            10526.00 -  3.3% : ipt_do_table
             9999.00 -  3.1% : hfsc_dequeue
             9976.00 -  3.1% : e1000_intr_msi
             9787.00 -  3.1% : rb_erase
             8259.00 -  2.6% : e1000_xmit_frame
             8015.00 -  2.5% : rb_insert_color
             7948.00 -  2.5% : update_vf
             6868.00 -  2.2% : e1000_clean_tx_irq
             6822.00 -  2.1% : e1000_clean_rx_irq
             6368.00 -  2.0% : __alloc_skb

##############################################################################
eth0 -> CPU0,CPU1,CPU2,CPU4 -> affinity echo 0f > /proc/irq/30/smp_affinity
eth1 -> CPU5,CPU6,CPU7,CPU8 -> affinity echo f0 > /proc/irq/31/smp_affinity
RPS:
echo 0f > /sys/class/net/eth0/queues/rx-0/rps_cpus
echo f0 > /sys/class/net/eth1/queues/rx-0/rps_cpus
------------------------------------------------------------------------------
   PerfTop:   81051 irqs/sec  kernel:96.9% [100000 cycles],  (all, 8 CPUs)
------------------------------------------------------------------------------

             samples    pcnt   kernel function
             _______   _____   _______________

           167110.00 - 22.3% : _raw_spin_lock
            58221.00 -  7.8% : u32_classify
            46379.00 -  6.2% : rb_next
            35189.00 -  4.7% : e1000_clean
            25614.00 -  3.4% : e1000_intr_msi
            24094.00 -  3.2% : hfsc_enqueue
            16231.00 -  2.2% : rb_erase
            14298.00 -  1.9% : rb_insert_color
            13751.00 -  1.8% : update_vf
            13712.00 -  1.8% : ipt_do_table
            13588.00 -  1.8% : hfsc_dequeue
            13335.00 -  1.8% : e1000_xmit_frame
            12449.00 -  1.7% : e1000_clean_tx_irq
            11510.00 -  1.5% : net_tx_action
            11428.00 -  1.5% : _raw_spin_lock_irqsave


^ permalink raw reply

* Re: Strange packet drops with heavy firewalling
From: Eric Dumazet @ 2010-04-13 12:53 UTC (permalink / raw)
  To: Paweł Staszewski; +Cc: Changli Gao, Benny Amorsen, zhigang gong, netdev
In-Reply-To: <4BC464A6.9000307@itcare.pl>

Le mardi 13 avril 2010 à 14:33 +0200, Paweł Staszewski a écrit :
> W dniu 2010-04-13 01:18, Changli Gao pisze:
> > On Tue, Apr 13, 2010 at 1:06 AM, Benny Amorsen<benny+usenet@amorsen.dk>  wrote:
> >    
> >>   99:         24    1306226          3          2   PCI-MSI-edge      eth1-tx-0
> >>   100:      15735    1648774          3          7   PCI-MSI-edge      eth1-tx-1
> >>   101:          8         11          9    1083022   PCI-MSI-edge      eth1-tx-2
> >>   102:          0          0          0          0   PCI-MSI-edge      eth1-tx-3
> >>   103:         18         15       6131    1095383   PCI-MSI-edge      eth1-rx-0
> >>   104:        217         32      46544    1335325   PCI-MSI-edge      eth1-rx-1
> >>   105:        154    1305595        218         16   PCI-MSI-edge      eth1-rx-2
> >>   106:         17         16       8229    1467509   PCI-MSI-edge      eth1-rx-3
> >>   107:          0          0          1          0   PCI-MSI-edge      eth1
> >>   108:          2         14         15    1003053   PCI-MSI-edge      eth0-tx-0
> >>   109:       8226    1668924        478        487   PCI-MSI-edge      eth0-tx-1
> >>   110:          3    1188874         17         12   PCI-MSI-edge      eth0-tx-2
> >>   111:          0          0          0          0   PCI-MSI-edge      eth0-tx-3
> >>   112:        203        185       5324    1015263   PCI-MSI-edge      eth0-rx-0
> >>   113:       4141    1600793        153        159   PCI-MSI-edge      eth0-rx-1
> >>   114:      16242    1210108        436       3124   PCI-MSI-edge      eth0-rx-2
> >>   115:        267       4173      19471    1321252   PCI-MSI-edge      eth0-rx-3
> >>   116:          0          1          0          0   PCI-MSI-edge      eth0
> >>
> >>
> >> irqbalanced seems to have picked CPU1 and CPU3 for all the interrupts,
> >> which to my mind should cause the same problem as before (where CPU1 and
> >> CPU3 was handling all packets). Yet the box clearly works much better
> >> than before.
> >>      
> > irqbalanced? I don't think it can work properly. Try RPS in netdev and
> > linux-next tree, and if cpu load isn't even, try this patch:
> > http://patchwork.ozlabs.org/patch/49915/ .
> >
> >
> >    
> Yes without irqbalance - and with irq affinity set by hand router will 
> work much better.
> 
> But I don't think that RPS will help him - I make some tests with RPS 
> and AFFINITY - results in attached file.
> Test router make traffic management (hfsc) for almost 9k users

Thanks for sharing Pawel.

But obviously you are mixing apples and oranges.

 Are you aware that HFSC and other trafic shapers do serialize access to
data structures ? If many cpus try to access these structures in //, you
have a lot of cache line misses. HFSC is a real memory hog :(

Benny do have firewalling (highly parallelized these days, iptables was
well improved in this area), but no traffic control.

Anyway, Benny has now multiqueue devices, and therefore RPS will not
help him. I suggested RPS before his move to multiqueue, and multiqueue
is the most sensible way to improve things, when no central lock is
used. Every cpu can really work in //.




^ permalink raw reply

* Re: [PATCH v2] net: batch skb dequeueing from softnet input_pkt_queue
From: Changli Gao @ 2010-04-13 12:53 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S. Miller, netdev
In-Reply-To: <1271153942.16881.233.camel@edumazet-laptop>

On Tue, Apr 13, 2010 at 6:19 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 13 avril 2010 à 17:50 +0800, Changli Gao a écrit :
>> On Tue, Apr 13, 2010 at 4:08 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >
>> >        Probably not necessary.
>> >
>> >> +     volatile bool           flush_processing_queue;
>> >
>> > Use of 'volatile' is strongly discouraged, I would say, forbidden.
>> >
>>
>> volatile is used to avoid compiler optimization.
>
> volatile might be used on special macros only, not to guard a variable.
> volatile was pre SMP days. We need something better defined these days.
>

flush_processing_queue is only accessed on the same CPU, so no
volatile is needed. I'll remove it in the next version.

>> >> @@ -2803,6 +2808,7 @@ static void flush_backlog(void *arg)
>> >>                       __skb_unlink(skb, &queue->input_pkt_queue);
>> >>                       kfree_skb(skb);
>> >>               }
>> >> +     queue->flush_processing_queue = true;
>> >
>> >        Probably not necessary
>> >
>>
>> If flush_backlog() is called when there are still packets in
>> processing_queue, there maybe some packets refer to the netdev gone,
>> if we remove this line.
>
> We dont need this "processing_queue". Once you remove it, there is no
> extra work to perform.

OK. If we make processing_queue is a stack variable. When quota or
jiffies limit is reached, we have to splice processing_queue back to
input_pkt_queue. If flush_backlog() is called before the
processing_queue is spliced, there will still packets which refer to
the NIC going. Then these packets are queued to input_pkt_queue. When
process_backlog() is called again, the dev field of these skbs are
wild...

Oh, my GOD. When RPS is enabled, if flush_backlog(eth0) is called on
CPU1 when a skb0(eth0) is dequeued from CPU0's softnet and isn't
queued to CPU1's softnet, what will happen?

>
>> >
>> >>
>> >
>> > I advise to keep it simple.
>> >
>> > My suggestion would be to limit this patch only to process_backlog().
>> >
>> > Really if you touch other areas, there is too much risk.
>> >
>> > Perform sort of skb_queue_splice_tail_init() into a local (stack) queue,
>> > but the trick is to not touch input_pkt_queue.qlen, so that we dont slow
>> > down enqueue_to_backlog().
>> >
>> > Process at most 'quota' skbs (or jiffies limit).
>> >
>> > relock queue.
>> > input_pkt_queue.qlen -= number_of_handled_skbs;
>> >
>>
>> Oh no, in order to let latter packets in as soon as possible, we have
>> to update qlen immediately.
>>
>
> Absolutely not. You missed something apparently.
>
> You pay the price at each packet enqueue, because you have to compute
> the sum of two lengthes, and guess what, if you do this you have a cache
> line miss in one of the operand. Your patch as is is suboptimal.
>
> Remember : this batch mode should not change packet queueing at all,
> only speed it because of less cache line misses.
>

WoW, is it really so expensive?

-- 
Regards，
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [Patch 3/3] net: reserve ports for applications using fixed port numbers
From: Tetsuo Handa @ 2010-04-13 13:07 UTC (permalink / raw)
  To: amwang, sean.hefty, rolandd
  Cc: opurdila, eric.dumazet, netdev, nhorman, davem, ebiederm,
	linux-kernel
In-Reply-To: <4BC42FE0.4040601@redhat.com>

Hello.

Adding Sean Hefty and Roland Dreier as drivers/infiniband/core/cma.c maintainer.

Cong Wang wrote:
> Cong Wang wrote:
> > Tetsuo Handa wrote:
> >> Hello.
> >>
> >>> --- linux-2.6.orig/drivers/infiniband/core/cma.c
> >>> +++ linux-2.6/drivers/infiniband/core/cma.c
> >>> @@ -1980,6 +1980,8 @@ retry:
> >>>  	/* FIXME: add proper port randomization per like inet_csk_get_port */
> >>>  	do {
> >>>  		ret = idr_get_new_above(ps, bind_list, next_port, &port);
> >>> +		if (!ret && inet_is_reserved_local_port(port))
> >>> +			ret = -EAGAIN;
> >>>  	} while ((ret == -EAGAIN) && idr_pre_get(ps, GFP_KERNEL));
> >>>  
> >>>  	if (ret)
> >>>
> >> I think above part is wrong. Below program
> > ...
> >> This result suggests that above loop will continue until idr_pre_get() fails
> >> due to out of memory if all ports were reserved.
> >>
> >> Also, if idr_get_new_above() returned 0, bind_list (which is a kmalloc()ed
> >> pointer) is already installed into a free slot (see comment on
> >> idr_get_new_above_int()). Thus, simply calling idr_get_new_above() again will
> >> install the same pointer into multiple slots. I guess it will malfunction later.
> > 
> > Thanks for testing!
> > 
> > How about:
> > 
> > +		if (!ret && inet_is_reserved_local_port(port))
> > +			ret = -EBUSY;
> > 
> > ? So that it will break the loop and return error.
> > 
> 
> Or use the similar trick:
> 
>  int tries = 10;
> ...
> 
>  if(!ret && inet_is_reserved_local_port(port)) {
>    if (tries--)
>      ret = -EAGAIN;
>    else
>      ret = -EBUSY;
>  }
> 
> Any comments?
> 
I don't like above change. Above change makes local port assignment from
"likely-succeed" (succeeds if one port is available from thousands of ports) to
"unlikely-succeed" (fail if randomly chosen port is already in use).
We should repeat for all ranges specified in /proc/sys/net/ipv4/ip_local_port_range .

cma_alloc_any_port() and cma_alloc_port() are almost identical.
Thus, I think we can call cma_alloc_port() from cma_alloc_any_port().

Sean and Roland, is below patch correct?
inet_is_reserved_local_port() is the new function proposed in this patchset.

---
 drivers/infiniband/core/cma.c |   68 ++++++++++++++----------------------------
 1 file changed, 23 insertions(+), 45 deletions(-)

--- linux-2.6.34-rc4.orig/drivers/infiniband/core/cma.c
+++ linux-2.6.34-rc4/drivers/infiniband/core/cma.c
@@ -79,7 +79,6 @@ static DEFINE_IDR(sdp_ps);
 static DEFINE_IDR(tcp_ps);
 static DEFINE_IDR(udp_ps);
 static DEFINE_IDR(ipoib_ps);
-static int next_port;
 
 struct cma_device {
 	struct list_head	list;
@@ -1970,47 +1969,31 @@ err1:
 
 static int cma_alloc_any_port(struct idr *ps, struct rdma_id_private *id_priv)
 {
-	struct rdma_bind_list *bind_list;
-	int port, ret, low, high;
-
-	bind_list = kzalloc(sizeof *bind_list, GFP_KERNEL);
-	if (!bind_list)
-		return -ENOMEM;
-
-retry:
-	/* FIXME: add proper port randomization per like inet_csk_get_port */
-	do {
-		ret = idr_get_new_above(ps, bind_list, next_port, &port);
-	} while ((ret == -EAGAIN) && idr_pre_get(ps, GFP_KERNEL));
-
-	if (ret)
-		goto err1;
+	static unsigned int last_used_port;
+	int low, high, remaining;
+	unsigned int rover;
 
 	inet_get_local_port_range(&low, &high);
-	if (port > high) {
-		if (next_port != low) {
-			idr_remove(ps, port);
-			next_port = low;
-			goto retry;
+	remaining = (high - low) + 1;
+	rover = net_random() % remaining + low;
+	do {
+		rover++;
+		if ((rover < low) || (rover > high))
+			rover = low;
+		if (last_used_port != rover &&
+		    !inet_is_reserved_local_port(rover) &&
+		    !idr_find(ps, (unsigned short) rover) &&
+		    !cma_alloc_port(ps, id_priv, rover)) {
+			/*
+			 * Remember previously used port number in order to
+			 * avoid re-using same port immediately after it is
+			 * closed.
+			 */
+			last_used_port = rover;
+			return 0;
 		}
-		ret = -EADDRNOTAVAIL;
-		goto err2;
-	}
-
-	if (port == high)
-		next_port = low;
-	else
-		next_port = port + 1;
-
-	bind_list->ps = ps;
-	bind_list->port = (unsigned short) port;
-	cma_bind_port(bind_list, id_priv);
-	return 0;
-err2:
-	idr_remove(ps, port);
-err1:
-	kfree(bind_list);
-	return ret;
+	} while (--remaining > 0);
+	return -EADDRNOTAVAIL;
 }
 
 static int cma_use_port(struct idr *ps, struct rdma_id_private *id_priv)
@@ -2995,12 +2978,7 @@ static void cma_remove_one(struct ib_dev
 
 static int __init cma_init(void)
 {
-	int ret, low, high, remaining;
-
-	get_random_bytes(&next_port, sizeof next_port);
-	inet_get_local_port_range(&low, &high);
-	remaining = (high - low) + 1;
-	next_port = ((unsigned int) next_port % remaining) + low;
+	int ret;
 
 	cma_wq = create_singlethread_workqueue("rdma_cm");
 	if (!cma_wq)

^ permalink raw reply

* Re: [PATCH v2] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-04-13 13:21 UTC (permalink / raw)
  To: Changli Gao; +Cc: David S. Miller, netdev
In-Reply-To: <n2t412e6f7f1004130553x85452fc0u22e512cad412abd3@mail.gmail.com>

Le mardi 13 avril 2010 à 20:53 +0800, Changli Gao a écrit :
> OK. If we make processing_queue is a stack variable. When quota or
> jiffies limit is reached, we have to splice processing_queue back to
> input_pkt_queue. If flush_backlog() is called before the
> processing_queue is spliced, there will still packets which refer to
> the NIC going. Then these packets are queued to input_pkt_queue. When
> process_backlog() is called again, the dev field of these skbs are
> wild...
> 

This is a problem of cooperation between flush_backlog() and
process_backlog(). Dont allow flush_backlog() to return if
process_backlog() is running. Exactly as before, but lock acquisition
done in flush_backlog() should be a bit smarter.

> Oh, my GOD. When RPS is enabled, if flush_backlog(eth0) is called on
> CPU1 when a skb0(eth0) is dequeued from CPU0's softnet and isn't
> queued to CPU1's softnet, what will happen?
> 

I am a bit lost here. flush_backlog() drops skbs, not requeue them.

> >
> > Absolutely not. You missed something apparently.
> >
> > You pay the price at each packet enqueue, because you have to compute
> > the sum of two lengthes, and guess what, if you do this you have a cache
> > line miss in one of the operand. Your patch as is is suboptimal.
> >
> > Remember : this batch mode should not change packet queueing at all,
> > only speed it because of less cache line misses.
> >
> 
> WoW, is it really so expensive?
> 

Yes. Whole point of your idea is to remove cache line misses.

They cost much more than a spinlock/unlock pair

^ permalink raw reply

* Re: [PATCH v2] net: batch skb dequeueing from softnet input_pkt_queue
From: Changli Gao @ 2010-04-13 13:38 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S. Miller, netdev
In-Reply-To: <1271164894.16881.342.camel@edumazet-laptop>

On Tue, Apr 13, 2010 at 9:21 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> This is a problem of cooperation between flush_backlog() and
> process_backlog(). Dont allow flush_backlog() to return if
> process_backlog() is running. Exactly as before, but lock acquisition
> done in flush_backlog() should be a bit smarter.
>

flush_backlog() is called in IRQ context. Unless you disable irq in
process_backlog(), you can't block flush_backlog().

>
>> Oh, my GOD. When RPS is enabled, if flush_backlog(eth0) is called on
>> CPU1 when a skb0(eth0) is dequeued from CPU0's softnet and isn't
>> queued to CPU1's softnet, what will happen?
>>
>
> I am a bit lost here. flush_backlog() drops skbs, not requeue them.
>

I mean flush_backlog() don't drop all the packets, whose dev point to
a special net_device and can't be processed before the net_device
disappers.

-- 
Regards，
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: Strange packet drops with heavy firewalling
From: Paweł Staszewski @ 2010-04-13 13:39 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Changli Gao, Benny Amorsen, zhigang gong, netdev
In-Reply-To: <1271163184.16881.307.camel@edumazet-laptop>

W dniu 2010-04-13 14:53, Eric Dumazet pisze:
> Le mardi 13 avril 2010 à 14:33 +0200, Paweł Staszewski a écrit :
>    
>> W dniu 2010-04-13 01:18, Changli Gao pisze:
>>      
>>> On Tue, Apr 13, 2010 at 1:06 AM, Benny Amorsen<benny+usenet@amorsen.dk>   wrote:
>>>
>>>        
>>>>    99:         24    1306226          3          2   PCI-MSI-edge      eth1-tx-0
>>>>    100:      15735    1648774          3          7   PCI-MSI-edge      eth1-tx-1
>>>>    101:          8         11          9    1083022   PCI-MSI-edge      eth1-tx-2
>>>>    102:          0          0          0          0   PCI-MSI-edge      eth1-tx-3
>>>>    103:         18         15       6131    1095383   PCI-MSI-edge      eth1-rx-0
>>>>    104:        217         32      46544    1335325   PCI-MSI-edge      eth1-rx-1
>>>>    105:        154    1305595        218         16   PCI-MSI-edge      eth1-rx-2
>>>>    106:         17         16       8229    1467509   PCI-MSI-edge      eth1-rx-3
>>>>    107:          0          0          1          0   PCI-MSI-edge      eth1
>>>>    108:          2         14         15    1003053   PCI-MSI-edge      eth0-tx-0
>>>>    109:       8226    1668924        478        487   PCI-MSI-edge      eth0-tx-1
>>>>    110:          3    1188874         17         12   PCI-MSI-edge      eth0-tx-2
>>>>    111:          0          0          0          0   PCI-MSI-edge      eth0-tx-3
>>>>    112:        203        185       5324    1015263   PCI-MSI-edge      eth0-rx-0
>>>>    113:       4141    1600793        153        159   PCI-MSI-edge      eth0-rx-1
>>>>    114:      16242    1210108        436       3124   PCI-MSI-edge      eth0-rx-2
>>>>    115:        267       4173      19471    1321252   PCI-MSI-edge      eth0-rx-3
>>>>    116:          0          1          0          0   PCI-MSI-edge      eth0
>>>>
>>>>
>>>> irqbalanced seems to have picked CPU1 and CPU3 for all the interrupts,
>>>> which to my mind should cause the same problem as before (where CPU1 and
>>>> CPU3 was handling all packets). Yet the box clearly works much better
>>>> than before.
>>>>
>>>>          
>>> irqbalanced? I don't think it can work properly. Try RPS in netdev and
>>> linux-next tree, and if cpu load isn't even, try this patch:
>>> http://patchwork.ozlabs.org/patch/49915/ .
>>>
>>>
>>>
>>>        
>> Yes without irqbalance - and with irq affinity set by hand router will
>> work much better.
>>
>> But I don't think that RPS will help him - I make some tests with RPS
>> and AFFINITY - results in attached file.
>> Test router make traffic management (hfsc) for almost 9k users
>>      
> Thanks for sharing Pawel.
>
> But obviously you are mixing apples and oranges.
>
>   Are you aware that HFSC and other trafic shapers do serialize access to
> data structures ? If many cpus try to access these structures in //, you
> have a lot of cache line misses. HFSC is a real memory hog :(
>
>    
Thanks Eric for explanation why RPS is useless for traffic management 
routers.

> Benny do have firewalling (highly parallelized these days, iptables was
> well improved in this area), but no traffic control.
>
>    
Hmm so maybe better choice for traffic management is use iptables for 
"filter classification" instead of "u32 filters"- something like 
iptables CLASSIFY target

> Anyway, Benny has now multiqueue devices, and therefore RPS will not
> help him. I suggested RPS before his move to multiqueue, and multiqueue
> is the most sensible way to improve things, when no central lock is
> used. Every cpu can really work in //.
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>    


^ permalink raw reply

* Re: forcedeth driver hangs under heavy load
From: stephen mulcahy @ 2010-04-13 14:27 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Eric Dumazet, netdev, Ben Hutchings, Ayaz Abdulla, 572201
In-Reply-To: <1271160298.2098.0.camel@achroite.uk.solarflarecom.com>

Ok, I've tried both of the following with my reproducer

1. ethtool -K eth0 tso off

RESULT: reproducer causes multiple hosts to be come unresponsive on 
first run.

2. ethtool -K eth0 tx off

RESULT: reproducer runs three times without any hosts becoming unresponsive.

-stephen

^ permalink raw reply

* Re: [PATCH v2] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-04-13 14:37 UTC (permalink / raw)
  To: Changli Gao; +Cc: David S. Miller, netdev
In-Reply-To: <x2z412e6f7f1004130638hb72b121cne442d722427de5a4@mail.gmail.com>

Le mardi 13 avril 2010 à 21:38 +0800, Changli Gao a écrit :
> On Tue, Apr 13, 2010 at 9:21 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > This is a problem of cooperation between flush_backlog() and
> > process_backlog(). Dont allow flush_backlog() to return if
> > process_backlog() is running. Exactly as before, but lock acquisition
> > done in flush_backlog() should be a bit smarter.
> >
> 
> flush_backlog() is called in IRQ context. Unless you disable irq in
> process_backlog(), you can't block flush_backlog().
> 

There is nothing preventing flush_backlog() to be done differently you
know. It was done like that because it was the most simple thing to do
given the (basic) constraints. Now if the constraints change,
implementation might change too. It is slow path (in most setups) and
some extra work to keep fast path really fast is ok.

netdevice are dismantled and we respect an RCU grace period before
freeing. process_backlog() is done inside a rcu lock, so everything is
possible.

^ permalink raw reply

* Re: forcedeth driver hangs under heavy load
From: Eric Dumazet @ 2010-04-13 14:42 UTC (permalink / raw)
  To: stephen mulcahy
  Cc: Ben Hutchings, netdev, Ben Hutchings, Ayaz Abdulla, 572201
In-Reply-To: <4BC47F38.5040509@gmail.com>

Le mardi 13 avril 2010 à 15:27 +0100, stephen mulcahy a écrit :
> Ok, I've tried both of the following with my reproducer
> 
> 1. ethtool -K eth0 tso off
> 
> RESULT: reproducer causes multiple hosts to be come unresponsive on 
> first run.
> 
> 2. ethtool -K eth0 tx off
> 
> RESULT: reproducer runs three times without any hosts becoming unresponsive.
> 
> -stephen

Thanks Stephen !

Now some brave fouls to check the 6410 lines of this driver ? ;)

Question of the day : Why TSO is broken in forcedeth ?
Is it generically broken or is it broken for specific NICS ?



^ permalink raw reply

* RE: [PATCH 2/2] [V5] Add non-Virtex5 support for LL TEMAC driver
From: John Linn @ 2010-04-13 14:43 UTC (permalink / raw)
  To: David Miller, grant.likely
  Cc: netdev, linuxppc-dev, jwboyer, eric.dumazet, john.williams,
	michal.simek, jtyner
In-Reply-To: <20100413.013403.184052247.davem@davemloft.net>

> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Tuesday, April 13, 2010 2:34 AM
> To: grant.likely@secretlab.ca
> Cc: John Linn; netdev@vger.kernel.org; linuxppc-dev@ozlabs.org;
jwboyer@linux.vnet.ibm.com;
> eric.dumazet@gmail.com; john.williams@petalogix.com;
michal.simek@petalogix.com; jtyner@cs.ucr.edu
> Subject: Re: [PATCH 2/2] [V5] Add non-Virtex5 support for LL TEMAC
driver
> 
> From: Grant Likely <grant.likely@secretlab.ca>
> Date: Fri, 9 Apr 2010 12:10:21 -0
> 
> > On Thu, Apr 8, 2010 at 11:08 AM, John Linn <john.linn@xilinx.com>
wrote:
> >> This patch adds support for using the LL TEMAC Ethernet driver on
> >> non-Virtex 5 platforms by adding support for accessing the Soft DMA
> >> registers as if they were memory mapped instead of solely through
the
> >> DCR's (available on the Virtex 5).
> >>
> >> The patch also updates the driver so that it runs on the
MicroBlaze.
> >> The changes were tested on the PowerPC 440, PowerPC 405, and the
> >> MicroBlaze platforms.
> >>
> >> Signed-off-by: John Tyner <jtyner@cs.ucr.edu>
> >> Signed-off-by: John Linn <john.linn@xilinx.com>
> >
> > Picked up and build tested both patches on 405, 440, 60x and ppc64.
> > No build problems found either built-in or as a module.
> >
> > for both:
> > Acked-by: Grant Likely <grant.likely@secretlab.ca>
> 
> Ok, both applied to net-next-2.6, thanks everyone for sorting this
> out.

Great! Thanks David, appreciate the help.

This email and any attachments are intended for the sole use of the named recipient(s) and contain(s) confidential information that may be proprietary, privileged or copyrighted under applicable law. If you are not the intended recipient, do not read, copy, or forward this email message or any attachments. Delete this email message and any attachments immediately.



^ permalink raw reply

* Re: forcedeth driver hangs under heavy load
From: stephen mulcahy @ 2010-04-13 14:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ben Hutchings, netdev, Ben Hutchings, Ayaz Abdulla, 572201
In-Reply-To: <1271169741.16881.437.camel@edumazet-laptop>

Eric Dumazet wrote:
> Le mardi 13 avril 2010 à 15:27 +0100, stephen mulcahy a écrit :
>> Ok, I've tried both of the following with my reproducer
>>
>> 1. ethtool -K eth0 tso off
>>
>> RESULT: reproducer causes multiple hosts to be come unresponsive on 
>> first run.
>>
>> 2. ethtool -K eth0 tx off
>>
>> RESULT: reproducer runs three times without any hosts becoming unresponsive.
>>
>> -stephen
> 
> Thanks Stephen !
> 
> Now some brave fouls to check the 6410 lines of this driver ? ;)
> 
> Question of the day : Why TSO is broken in forcedeth ?
> Is it generically broken or is it broken for specific NICS ?
> 

Actually, it is only when tx-checksumming is turned off that the problem 
  doesn't occur (so I'm not sure TSO is the problem).

Additionally, a google also turns up this existing Debian bug 
http://bugs.debian.org/506419 which seems to be related.

-stephen


^ permalink raw reply

* [PATCH] tun: orphan an skb on tx
From: Michael S. Tsirkin @ 2010-04-13 14:59 UTC (permalink / raw)
  Cc: David S. Miller, Herbert Xu, Michael S. Tsirkin, Paul Moore,
	David Woodhouse, netdev, linux-kernel, Jan Kiszka, qemu-devel

The following situation was observed in the field:
tap1 sends packets, tap2 does not consume them, as a result
tap1 can not be closed. This happens because
tun/tap devices can hang on to skbs undefinitely.

As noted by Herbert, possible solutions include a timeout followed by a
copy/change of ownership of the skb, or always copying/changing
ownership if we're going into a hostile device.

This patch implements the second approach.

Note: one issue still remaining is that since skbs
keep reference to tun socket and tun socket has a
reference to tun device, we won't flush backlog,
instead simply waiting for all skbs to get transmitted.
At least this is not user-triggerable, and
this was not reported in practice, my assumption is
other devices besides tap complete an skb
within finite time after it has been queued.

A possible solution for the second issue
would not to have socket reference the device,
instead, implement dev->destructor for tun, and
wait for all skbs to complete there, but this
needs some thought, probably too risky for 2.6.34.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Yan Vugenfirer <yvugenfi@redhat.com>

---

Please review the below, and consider for 2.6.34,
and stable trees.

 drivers/net/tun.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 96c39bd..4326520 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -387,6 +387,10 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 		}
 	}

+	/* Orphan the skb - required as we might hang on to it
+	 * for indefinite time. */
+	skb_orphan(skb);
+
 	/* Enqueue packet */
 	skb_queue_tail(&tun->socket.sk->sk_receive_queue, skb);
 	dev->trans_start = jiffies;
-- 
1.7.0.2.280.gc6f05

^ permalink raw reply related

* Re: forcedeth driver hangs under heavy load
From: stephen mulcahy @ 2010-04-13 15:00 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ben Hutchings, netdev, Ben Hutchings, Ayaz Abdulla, 572201
In-Reply-To: <4BC48460.4040001@gmail.com>

stephen mulcahy wrote:
>> Now some brave fouls to check the 6410 lines of this driver ? ;)
>>
>> Question of the day : Why TSO is broken in forcedeth ?
>> Is it generically broken or is it broken for specific NICS ?
>>
> 
> Actually, it is only when tx-checksumming is turned off that the problem 
>  doesn't occur (so I'm not sure TSO is the problem).
> 
> Additionally, a google also turns up this existing Debian bug 
> http://bugs.debian.org/506419 which seems to be related.

As mentioned in the original Debian bug - I can reproduce this by 
running Hadoop[1] TeraSort[2] but I haven't identified a simpler 
reproducer. I tried to recreate this with iperf and ping -f but neither 
helped - it may be that the problem only occurs when systems are passing 
large amounts of traffic and have very high cpu utilisation (when 
running the Hadoop TeraSort all 8 cores run at 70-100% utilisation as 
measure with htop - I plan to instrument the nodes with something like 
Zabbix or Ganglia but it hasn't happened yet).

-stephen

[1] http://hadoop.apache.org/
[2] 
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/examples/terasort/package-summary.html

^ permalink raw reply

* Re: [PATCH] vhost-net: fix vq_memory_access_ok error checking
From: Michael S. Tsirkin @ 2010-04-13 15:01 UTC (permalink / raw)
  To: Jeff Dike
  Cc: Juan Quintela, Rusty Russell, David S. Miller, Laurent Chavey,
	kvm, virtualization, netdev, linux-kernel

On Wed, Apr 07, 2010 at 09:59:10AM -0400, Jeff Dike wrote:
> vq_memory_access_ok needs to check whether mem == NULL
> 
> Signed-off-by: Jeff Dike <jdike@linux.intel.com>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

This was already queued by me, you do not need to
fill Dave's inbox with vhost patches.

> ---
>  drivers/vhost/vhost.c |    4 ++++
>  1 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 7bd7a1e..b8e1127 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -235,6 +235,10 @@ static int vq_memory_access_ok(void __user *log_base, struct vhost_memory *mem,
>  			       int log_all)
>  {
>  	int i;
> +
> +        if (!mem)
> +                return 0;
> +
>  	for (i = 0; i < mem->nregions; ++i) {
>  		struct vhost_memory_region *m = mem->regions + i;
>  		unsigned long a = m->userspace_addr;
> -- 
> 1.7.0.2.280.gc6f05

^ permalink raw reply

* Re: [PATCH] vhost-net: fix vq_memory_access_ok error checking
From: Michael S. Tsirkin @ 2010-04-13 15:02 UTC (permalink / raw)
  To: Jeff Dike
  Cc: Juan Quintela, Rusty Russell, David S. Miller, Laurent Chavey,
	kvm, virtualization, netdev, linux-kernel
In-Reply-To: <20100413150121.GB7716@redhat.com>

On Tue, Apr 13, 2010 at 06:01:21PM +0300, Michael S. Tsirkin wrote:
> On Wed, Apr 07, 2010 at 09:59:10AM -0400, Jeff Dike wrote:
> > vq_memory_access_ok needs to check whether mem == NULL
> > 
> > Signed-off-by: Jeff Dike <jdike@linux.intel.com>
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> This was already queued by me, you do not need to
> fill Dave's inbox with vhost patches.

This was sent in error, please ignore.
Sorry about the noise.

> > ---
> >  drivers/vhost/vhost.c |    4 ++++
> >  1 files changed, 4 insertions(+), 0 deletions(-)
> > 
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 7bd7a1e..b8e1127 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -235,6 +235,10 @@ static int vq_memory_access_ok(void __user *log_base, struct vhost_memory *mem,
> >  			       int log_all)
> >  {
> >  	int i;
> > +
> > +        if (!mem)
> > +                return 0;
> > +
> >  	for (i = 0; i < mem->nregions; ++i) {
> >  		struct vhost_memory_region *m = mem->regions + i;
> >  		unsigned long a = m->userspace_addr;
> > -- 
> > 1.7.0.2.280.gc6f05

^ permalink raw reply

* [PATCH 1/9] net: fib_rules: consolidate IPv4 and DECnet ->default_pref() functions.
From: Patrick McHardy @ 2010-04-13 15:03 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1271171003-11901-1-git-send-email-kaber@trash.net>

Both functions are equivalent, consolidate them since a following patch
needs a third implementation for multicast routing.

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 include/net/fib_rules.h |    1 +
 net/core/fib_rules.c    |   18 ++++++++++++++++++
 net/decnet/dn_rules.c   |   19 +------------------
 net/ipv4/fib_rules.c    |   19 +------------------
 4 files changed, 21 insertions(+), 36 deletions(-)

diff --git a/include/net/fib_rules.h b/include/net/fib_rules.h
index c49086d..52bd9e6 100644
--- a/include/net/fib_rules.h
+++ b/include/net/fib_rules.h
@@ -114,4 +114,5 @@ extern int			fib_rules_lookup(struct fib_rules_ops *,
 extern int			fib_default_rule_add(struct fib_rules_ops *,
 						     u32 pref, u32 table,
 						     u32 flags);
+extern u32			fib_default_rule_pref(struct fib_rules_ops *ops);
 #endif
diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 05cce4e..1eb3227 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -39,6 +39,24 @@ int fib_default_rule_add(struct fib_rules_ops *ops,
 }
 EXPORT_SYMBOL(fib_default_rule_add);
 
+u32 fib_default_rule_pref(struct fib_rules_ops *ops)
+{
+	struct list_head *pos;
+	struct fib_rule *rule;
+
+	if (!list_empty(&ops->rules_list)) {
+		pos = ops->rules_list.next;
+		if (pos->next != &ops->rules_list) {
+			rule = list_entry(pos->next, struct fib_rule, list);
+			if (rule->pref)
+				return rule->pref - 1;
+		}
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(fib_default_rule_pref);
+
 static void notify_rule_change(int event, struct fib_rule *rule,
 			       struct fib_rules_ops *ops, struct nlmsghdr *nlh,
 			       u32 pid);
diff --git a/net/decnet/dn_rules.c b/net/decnet/dn_rules.c
index 7466c54..2d14093 100644
--- a/net/decnet/dn_rules.c
+++ b/net/decnet/dn_rules.c
@@ -212,23 +212,6 @@ nla_put_failure:
 	return -ENOBUFS;
 }
 
-static u32 dn_fib_rule_default_pref(struct fib_rules_ops *ops)
-{
-	struct list_head *pos;
-	struct fib_rule *rule;
-
-	if (!list_empty(&dn_fib_rules_ops->rules_list)) {
-		pos = dn_fib_rules_ops->rules_list.next;
-		if (pos->next != &dn_fib_rules_ops->rules_list) {
-			rule = list_entry(pos->next, struct fib_rule, list);
-			if (rule->pref)
-				return rule->pref - 1;
-		}
-	}
-
-	return 0;
-}
-
 static void dn_fib_rule_flush_cache(struct fib_rules_ops *ops)
 {
 	dn_rt_cache_flush(-1);
@@ -243,7 +226,7 @@ static struct fib_rules_ops dn_fib_rules_ops_template = {
 	.configure	= dn_fib_rule_configure,
 	.compare	= dn_fib_rule_compare,
 	.fill		= dn_fib_rule_fill,
-	.default_pref	= dn_fib_rule_default_pref,
+	.default_pref	= fib_default_rule_pref,
 	.flush_cache	= dn_fib_rule_flush_cache,
 	.nlgroup	= RTNLGRP_DECnet_RULE,
 	.policy		= dn_fib_rule_policy,
diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c
index ca2d07b..73b6784 100644
--- a/net/ipv4/fib_rules.c
+++ b/net/ipv4/fib_rules.c
@@ -234,23 +234,6 @@ nla_put_failure:
 	return -ENOBUFS;
 }
 
-static u32 fib4_rule_default_pref(struct fib_rules_ops *ops)
-{
-	struct list_head *pos;
-	struct fib_rule *rule;
-
-	if (!list_empty(&ops->rules_list)) {
-		pos = ops->rules_list.next;
-		if (pos->next != &ops->rules_list) {
-			rule = list_entry(pos->next, struct fib_rule, list);
-			if (rule->pref)
-				return rule->pref - 1;
-		}
-	}
-
-	return 0;
-}
-
 static size_t fib4_rule_nlmsg_payload(struct fib_rule *rule)
 {
 	return nla_total_size(4) /* dst */
@@ -272,7 +255,7 @@ static struct fib_rules_ops fib4_rules_ops_template = {
 	.configure	= fib4_rule_configure,
 	.compare	= fib4_rule_compare,
 	.fill		= fib4_rule_fill,
-	.default_pref	= fib4_rule_default_pref,
+	.default_pref	= fib_default_rule_pref,
 	.nlmsg_payload	= fib4_rule_nlmsg_payload,
 	.flush_cache	= fib4_rule_flush_cache,
 	.nlgroup	= RTNLGRP_IPV4_RULE,
-- 
1.7.0.4


^ permalink raw reply related

* [PATCH 4/9] ipv4: raw: move struct raw_sock and raw_sk() to include/net/raw.h
From: Patrick McHardy @ 2010-04-13 15:03 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1271171003-11901-1-git-send-email-kaber@trash.net>

A following patch will use struct raw_sock to store state for ipmr,
so having the definitions in icmp.h doesn't fit very well anymore.

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 include/net/icmp.h |   11 -----------
 include/net/raw.h  |   12 ++++++++++++
 2 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/include/net/icmp.h b/include/net/icmp.h
index 15b3dfe..6e991e0 100644
--- a/include/net/icmp.h
+++ b/include/net/icmp.h
@@ -48,15 +48,4 @@ extern void	icmp_out_count(struct net *net, unsigned char type);
 /* Move into dst.h ? */
 extern int 	xrlim_allow(struct dst_entry *dst, int timeout);
 
-struct raw_sock {
-	/* inet_sock has to be the first member */
-	struct inet_sock   inet;
-	struct icmp_filter filter;
-};
-
-static inline struct raw_sock *raw_sk(const struct sock *sk)
-{
-	return (struct raw_sock *)sk;
-}
-
 #endif	/* _ICMP_H */
diff --git a/include/net/raw.h b/include/net/raw.h
index 6c14a65..67cc643 100644
--- a/include/net/raw.h
+++ b/include/net/raw.h
@@ -19,6 +19,7 @@
 
 
 #include <net/protocol.h>
+#include <linux/icmp.h>
 
 extern struct proto raw_prot;
 
@@ -56,4 +57,15 @@ int raw_seq_open(struct inode *ino, struct file *file,
 void raw_hash_sk(struct sock *sk);
 void raw_unhash_sk(struct sock *sk);
 
+struct raw_sock {
+	/* inet_sock has to be the first member */
+	struct inet_sock   inet;
+	struct icmp_filter filter;
+};
+
+static inline struct raw_sock *raw_sk(const struct sock *sk)
+{
+	return (struct raw_sock *)sk;
+}
+
 #endif	/* _RAW_H */
-- 
1.7.0.4


^ permalink raw reply related

* [PATCH 0/9] net: support multiple independant multicast routing instances
From: Patrick McHardy @ 2010-04-13 15:03 UTC (permalink / raw)
  To: davem; +Cc: netdev

Hi Dave,

this is an updated patchset of my patches to support multiple independant
multicast routing instances. Changes since the last posting are:

- rebase to the current net-next-2.6.git tree
- fix up patch subjects to consistently refer to "ipv4: ipmr:"
- fix up list_head conversion patch to add new elements at the head of
  the list instead of at the tail

Please apply or pull from:

git://git.kernel.org/pub/scm/linux/kernel/git/kaber/ipmr-2.6.git master

Thanks!


^ permalink raw reply

* [PATCH 3/9] net: fib_rules: decouple address families from real address families
From: Patrick McHardy @ 2010-04-13 15:03 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1271171003-11901-1-git-send-email-kaber@trash.net>

Decouple the address family values used for fib_rules from the real
address families in socket.h. This allows to use fib_rules for
code that is not a real address family without increasing AF_MAX/NPROTO.

Values up to 127 are reserved for real address families and map directly
to the corresponding AF value, values starting from 128 are for other
uses. rtnetlink is changed to invoke the AF_UNSPEC dumpit/doit handlers
for these families.

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 include/linux/fib_rules.h |    7 +++++++
 net/core/rtnetlink.c      |   15 ++++++++++-----
 net/decnet/dn_rules.c     |    2 +-
 net/ipv4/fib_rules.c      |    2 +-
 net/ipv6/fib6_rules.c     |    2 +-
 5 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/include/linux/fib_rules.h b/include/linux/fib_rules.h
index 51da65b..405e411 100644
--- a/include/linux/fib_rules.h
+++ b/include/linux/fib_rules.h
@@ -15,6 +15,13 @@
 /* try to find source address in routing lookups */
 #define FIB_RULE_FIND_SADDR	0x00010000
 
+/* fib_rules families. values up to 127 are reserved for real address
+ * families, values above 128 may be used arbitrarily.
+ */
+#define FIB_RULES_IPV4		AF_INET
+#define FIB_RULES_IPV6		AF_INET6
+#define FIB_RULES_DECNET	AF_DECnet
+
 struct fib_rule_hdr {
 	__u8		family;
 	__u8		dst_len;
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index bf919b6..78c8598 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -118,7 +118,11 @@ static rtnl_doit_func rtnl_get_doit(int protocol, int msgindex)
 {
 	struct rtnl_link *tab;
 
-	tab = rtnl_msg_handlers[protocol];
+	if (protocol < NPROTO)
+		tab = rtnl_msg_handlers[protocol];
+	else
+		tab = NULL;
+
 	if (tab == NULL || tab[msgindex].doit == NULL)
 		tab = rtnl_msg_handlers[PF_UNSPEC];
 
@@ -129,7 +133,11 @@ static rtnl_dumpit_func rtnl_get_dumpit(int protocol, int msgindex)
 {
 	struct rtnl_link *tab;
 
-	tab = rtnl_msg_handlers[protocol];
+	if (protocol < NPROTO)
+		tab = rtnl_msg_handlers[protocol];
+	else
+		tab = NULL;
+
 	if (tab == NULL || tab[msgindex].dumpit == NULL)
 		tab = rtnl_msg_handlers[PF_UNSPEC];
 
@@ -1444,9 +1452,6 @@ static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
 		return 0;
 
 	family = ((struct rtgenmsg *)NLMSG_DATA(nlh))->rtgen_family;
-	if (family >= NPROTO)
-		return -EAFNOSUPPORT;
-
 	sz_idx = type>>2;
 	kind = type&3;
 
diff --git a/net/decnet/dn_rules.c b/net/decnet/dn_rules.c
index 1c8cc6d..af28dcc 100644
--- a/net/decnet/dn_rules.c
+++ b/net/decnet/dn_rules.c
@@ -217,7 +217,7 @@ static void dn_fib_rule_flush_cache(struct fib_rules_ops *ops)
 }
 
 static struct fib_rules_ops dn_fib_rules_ops_template = {
-	.family		= AF_DECnet,
+	.family		= FIB_RULES_DECNET,
 	.rule_size	= sizeof(struct dn_fib_rule),
 	.addr_size	= sizeof(u16),
 	.action		= dn_fib_rule_action,
diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c
index a18355e..3ec84fe 100644
--- a/net/ipv4/fib_rules.c
+++ b/net/ipv4/fib_rules.c
@@ -246,7 +246,7 @@ static void fib4_rule_flush_cache(struct fib_rules_ops *ops)
 }
 
 static struct fib_rules_ops fib4_rules_ops_template = {
-	.family		= AF_INET,
+	.family		= FIB_RULES_IPV4,
 	.rule_size	= sizeof(struct fib4_rule),
 	.addr_size	= sizeof(u32),
 	.action		= fib4_rule_action,
diff --git a/net/ipv6/fib6_rules.c b/net/ipv6/fib6_rules.c
index 92b2b7f..8124f16 100644
--- a/net/ipv6/fib6_rules.c
+++ b/net/ipv6/fib6_rules.c
@@ -238,7 +238,7 @@ static size_t fib6_rule_nlmsg_payload(struct fib_rule *rule)
 }
 
 static struct fib_rules_ops fib6_rules_ops_template = {
-	.family			= AF_INET6,
+	.family			= FIB_RULES_IPV6,
 	.rule_size		= sizeof(struct fib6_rule),
 	.addr_size		= sizeof(struct in6_addr),
 	.action			= fib6_rule_action,
-- 
1.7.0.4


^ permalink raw reply related

* [PATCH 2/9] net: fib_rules: set family in fib_rule_hdr centrally
From: Patrick McHardy @ 2010-04-13 15:03 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1271171003-11901-1-git-send-email-kaber@trash.net>

All fib_rules implementations need to set the family in their ->fill()
functions. Since the value is available to the generic fib_nl_fill_rule()
function, set it there.

Signed-off-by: Patrick McHardy <kaber@trash.net>
---
 net/core/fib_rules.c  |    1 +
 net/decnet/dn_rules.c |    1 -
 net/ipv4/fib_rules.c  |    1 -
 net/ipv6/fib6_rules.c |    1 -
 4 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 1eb3227..1bc6659 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -535,6 +535,7 @@ static int fib_nl_fill_rule(struct sk_buff *skb, struct fib_rule *rule,
 		return -EMSGSIZE;
 
 	frh = nlmsg_data(nlh);
+	frh->family = ops->family;
 	frh->table = rule->table;
 	NLA_PUT_U32(skb, FRA_TABLE, rule->table);
 	frh->res1 = 0;
diff --git a/net/decnet/dn_rules.c b/net/decnet/dn_rules.c
index 2d14093..1c8cc6d 100644
--- a/net/decnet/dn_rules.c
+++ b/net/decnet/dn_rules.c
@@ -196,7 +196,6 @@ static int dn_fib_rule_fill(struct fib_rule *rule, struct sk_buff *skb,
 {
 	struct dn_fib_rule *r = (struct dn_fib_rule *)rule;
 
-	frh->family = AF_DECnet;
 	frh->dst_len = r->dst_len;
 	frh->src_len = r->src_len;
 	frh->tos = 0;
diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c
index 73b6784..a18355e 100644
--- a/net/ipv4/fib_rules.c
+++ b/net/ipv4/fib_rules.c
@@ -213,7 +213,6 @@ static int fib4_rule_fill(struct fib_rule *rule, struct sk_buff *skb,
 {
 	struct fib4_rule *rule4 = (struct fib4_rule *) rule;
 
-	frh->family = AF_INET;
 	frh->dst_len = rule4->dst_len;
 	frh->src_len = rule4->src_len;
 	frh->tos = rule4->tos;
diff --git a/net/ipv6/fib6_rules.c b/net/ipv6/fib6_rules.c
index 5e463c4..92b2b7f 100644
--- a/net/ipv6/fib6_rules.c
+++ b/net/ipv6/fib6_rules.c
@@ -208,7 +208,6 @@ static int fib6_rule_fill(struct fib_rule *rule, struct sk_buff *skb,
 {
 	struct fib6_rule *rule6 = (struct fib6_rule *) rule;
 
-	frh->family = AF_INET6;
 	frh->dst_len = rule6->dst.plen;
 	frh->src_len = rule6->src.plen;
 	frh->tos = rule6->tclass;
-- 
1.7.0.4


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox