Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: ipv4 regression in 2.6.31 ?
From: Eric Dumazet @ 2009-09-14 16:10 UTC (permalink / raw)
  To: Stephan von Krawczynski
  Cc: Stephen Hemminger, linux-kernel, davem, Linux Netdev List
In-Reply-To: <20090914175505.a3f132ee.skraw@ithnet.com>

Stephan von Krawczynski a écrit :
> On Mon, 14 Sep 2009 15:57:03 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
>> Stephan von Krawczynski a écrit :
>>> Hello all,
>>>
>>> today we experienced some sort of regression in 2.6.31 ipv4 implementation, or
>>> at least some incompatibility with former 2.6.30.X kernels.
>>>
>>> We have the following situation:
>>>
>>>                                        ---------- vlan1@eth0 192.168.2.1/24
>>>                                       /
>>> host A 192.168.1.1/24 eth0  -------<router>            host B
>>>                                       \
>>>                                        ---------- eth1 192.168.3.1/24
>>>
>>>
>>> Now, if you route 192.168.1.0/24 via interface vlan1@eth0 on host B and let
>>> host A ping 192.168.2.1 everything works. But if you route 192.168.1.0/24 via
>>> interface eth1 on host B and let host A ping 192.168.2.1 you get no reply.
>>> With tcpdump we see the icmp packets arrive at vlan1@eth0, but no icmp echo
>>> reply being generated neither on vlan1 nor eth1.
>>> Kernels 2.6.30.X and below do not show this behaviour.
>>> Is this intended? Do we need to reconfigure something to restore the old
>>> behaviour?
>>>
>> Asymetric routing ?
>>
>> Check your rp_filter settings
>>
>> grep . `find /proc/sys/net -name rp_filter`
>>
>> rp_filter - INTEGER
>>         0 - No source validation.
>>         1 - Strict mode as defined in RFC3704 Strict Reverse Path
>>             Each incoming packet is tested against the FIB and if the interface
>>             is not the best reverse path the packet check will fail.
>>             By default failed packets are discarded.
>>         2 - Loose mode as defined in RFC3704 Loose Reverse Path
>>             Each incoming packet's source address is also tested against the FIB
>>             and if the source address is not reachable via any interface
>>             the packet check will fail.
>>
>>         Current recommended practice in RFC3704 is to enable strict mode
>>         to prevent IP spoofing from DDos attacks. If using asymmetric routing
>>         or other complicated routing, then loose mode is recommended.
>>
>>         conf/all/rp_filter must also be set to non-zero to do source validation
>>         on the interface
>>
>>         Default value is 0. Note that some distributions enable it
>>         in startup scripts.
> 
> Ok, here you can see 2.6.31 values from the discussed box:
> (remember, no ping reply in this setup)
> 
> /proc/sys/net/ipv4/conf/all/rp_filter:1
> /proc/sys/net/ipv4/conf/default/rp_filter:0
> /proc/sys/net/ipv4/conf/lo/rp_filter:0
> /proc/sys/net/ipv4/conf/eth2/rp_filter:0
> /proc/sys/net/ipv4/conf/eth0/rp_filter:0
> /proc/sys/net/ipv4/conf/eth1/rp_filter:0
> /proc/sys/net/ipv4/conf/vlan1/rp_filter:0
> 
> 
> And these are from the same box with 2.6.30.5:
> (ping reply works)
> 
> /proc/sys/net/ipv4/conf/all/rp_filter:1
> /proc/sys/net/ipv4/conf/default/rp_filter:0
> /proc/sys/net/ipv4/conf/lo/rp_filter:0
> /proc/sys/net/ipv4/conf/eth2/rp_filter:0
> /proc/sys/net/ipv4/conf/eth0/rp_filter:0
> /proc/sys/net/ipv4/conf/eth1/rp_filter:0
> /proc/sys/net/ipv4/conf/vlan1/rp_filter:0
> 
> As you can see they're all the same. Does this mean that rp_filter never
> really worked as intended before 2.6.31 ? Or does it mean that rp_filter=0
> (eth1 and vlan1) gets overriden by all/rp_filter=1 in 2.6.31 and not before?
>

Yes, previous kernels ignored /proc/sys/net/ipv4/conf/all/rp_filter value, it was a bug.

commit 27fed4175acf81ddd91d9a4ee2fd298981f60295
Author: Stephen Hemminger <shemminger@vyatta.com>
Date:   Mon Jul 27 18:39:45 2009 -0700

    ip: fix logic of reverse path filter sysctl

    Even though reverse path filter was changed from simple boolean to
    trinary control, the loose mode only works if both all and device are
    configured because of this logic error.

    Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>


In your case, you *need*
echo 0 >/proc/sys/net/ipv4/conf/all/rp_filter
or
echo 2 >/proc/sys/net/ipv4/conf/all/rp_filter


^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Gregory Haskins @ 2009-09-14 16:08 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Ira W. Snyder, netdev, virtualization, kvm, linux-kernel, mingo,
	linux-mm, akpm, hpa, Rusty Russell, s.hetze
In-Reply-To: <20090913120140.GA31218@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 1453 bytes --]

Michael S. Tsirkin wrote:
> On Fri, Sep 11, 2009 at 12:00:21PM -0400, Gregory Haskins wrote:
>> FWIW: VBUS handles this situation via the "memctx" abstraction.  IOW,
>> the memory is not assumed to be a userspace address.  Rather, it is a
>> memctx-specific address, which can be userspace, or any other type
>> (including hardware, dma-engine, etc).  As long as the memctx knows how
>> to translate it, it will work.
> 
> How would permissions be handled?

Same as anything else, really.  Read on for details.

> it's easy to allow an app to pass in virtual addresses in its own address space.

Agreed, and this is what I do.

The guest always passes its own physical addresses (using things like
__pa() in linux).  This address passed is memctx specific, but generally
would fall into the category of "virtual-addresses" from the hosts
perspective.

For a KVM/AlacrityVM guest example, the addresses are GPAs, accessed
internally to the context via a gfn_to_hva conversion (you can see this
occuring in the citation links I sent)

For Ira's example, the addresses would represent a physical address on
the PCI boards, and would follow any kind of relevant rules for
converting a "GPA" to a host accessible address (even if indirectly, via
a dma controller).

>  But we can't let the guest specify physical addresses.

Agreed.  Neither your proposal nor mine operate this way afaict.

HTH

Kind Regards,
-Greg

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply

* Re: ipv4 regression in 2.6.31 ?
From: Stephan von Krawczynski @ 2009-09-14 15:55 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, davem, Linux Netdev List
In-Reply-To: <4AAE4BAF.2010406@gmail.com>

On Mon, 14 Sep 2009 15:57:03 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Stephan von Krawczynski a écrit :
> > Hello all,
> > 
> > today we experienced some sort of regression in 2.6.31 ipv4 implementation, or
> > at least some incompatibility with former 2.6.30.X kernels.
> > 
> > We have the following situation:
> > 
> >                                        ---------- vlan1@eth0 192.168.2.1/24
> >                                       /
> > host A 192.168.1.1/24 eth0  -------<router>            host B
> >                                       \
> >                                        ---------- eth1 192.168.3.1/24
> > 
> > 
> > Now, if you route 192.168.1.0/24 via interface vlan1@eth0 on host B and let
> > host A ping 192.168.2.1 everything works. But if you route 192.168.1.0/24 via
> > interface eth1 on host B and let host A ping 192.168.2.1 you get no reply.
> > With tcpdump we see the icmp packets arrive at vlan1@eth0, but no icmp echo
> > reply being generated neither on vlan1 nor eth1.
> > Kernels 2.6.30.X and below do not show this behaviour.
> > Is this intended? Do we need to reconfigure something to restore the old
> > behaviour?
> > 
> 
> Asymetric routing ?
> 
> Check your rp_filter settings
> 
> grep . `find /proc/sys/net -name rp_filter`
> 
> rp_filter - INTEGER
>         0 - No source validation.
>         1 - Strict mode as defined in RFC3704 Strict Reverse Path
>             Each incoming packet is tested against the FIB and if the interface
>             is not the best reverse path the packet check will fail.
>             By default failed packets are discarded.
>         2 - Loose mode as defined in RFC3704 Loose Reverse Path
>             Each incoming packet's source address is also tested against the FIB
>             and if the source address is not reachable via any interface
>             the packet check will fail.
> 
>         Current recommended practice in RFC3704 is to enable strict mode
>         to prevent IP spoofing from DDos attacks. If using asymmetric routing
>         or other complicated routing, then loose mode is recommended.
> 
>         conf/all/rp_filter must also be set to non-zero to do source validation
>         on the interface
> 
>         Default value is 0. Note that some distributions enable it
>         in startup scripts.

Ok, here you can see 2.6.31 values from the discussed box:
(remember, no ping reply in this setup)

/proc/sys/net/ipv4/conf/all/rp_filter:1
/proc/sys/net/ipv4/conf/default/rp_filter:0
/proc/sys/net/ipv4/conf/lo/rp_filter:0
/proc/sys/net/ipv4/conf/eth2/rp_filter:0
/proc/sys/net/ipv4/conf/eth0/rp_filter:0
/proc/sys/net/ipv4/conf/eth1/rp_filter:0
/proc/sys/net/ipv4/conf/vlan1/rp_filter:0


And these are from the same box with 2.6.30.5:
(ping reply works)

/proc/sys/net/ipv4/conf/all/rp_filter:1
/proc/sys/net/ipv4/conf/default/rp_filter:0
/proc/sys/net/ipv4/conf/lo/rp_filter:0
/proc/sys/net/ipv4/conf/eth2/rp_filter:0
/proc/sys/net/ipv4/conf/eth0/rp_filter:0
/proc/sys/net/ipv4/conf/eth1/rp_filter:0
/proc/sys/net/ipv4/conf/vlan1/rp_filter:0

As you can see they're all the same. Does this mean that rp_filter never
really worked as intended before 2.6.31 ? Or does it mean that rp_filter=0
(eth1 and vlan1) gets overriden by all/rp_filter=1 in 2.6.31 and not before?

--
Regards,
Stephan

^ permalink raw reply

* Re: [PATCH RFC] tun: export underlying socket
From: Michael S. Tsirkin @ 2009-09-14 15:40 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: David Miller, netdev, herbert
In-Reply-To: <4AAE4DFC.9080500@voltaire.com>

On Mon, Sep 14, 2009 at 05:06:52PM +0300, Or Gerlitz wrote:
> Michael S. Tsirkin wrote:
>>> how  would the use case with vhost will look like?
>> - Configure bridge and tun using existing scripts
>> - pass tun fd to vhost via an ioctl
>> - vhost calls tun_get_socket
>> - from this point, guest networking just goes faster
>
> let me see I am with you:
>
> 1. vhost gets from user space through ioctl packet socket fd OR tun fd -  
> but never both

Right

> 2. for packet socket fd
> VM.TX is translated by vhost to sendmsg which goes through the NIC
> NIC RX  makes the fd poll to signal and then recvmsg is called on the  
> fd, then vhost places the packet in a virtq
>
> 3. for tun fd
> VM.TX is translated by vhost to sendmsg which is translated by tun to  
> netif_rx which is then handled by the bridge
> NIC RX  goes to the bridge which xmits the packet a tun interface, now  
> what makes tun provide this packet to vhost and how it is done?

Same as above. vhost polls tun and calls recvmsg on the socket.

>
>> A lot of people have asked for tun support in vhost, because qemu
>> currently uses tun.  With this scheme existing code and scripts can
>> be used to configure both tun and bridge.  You also can utilize
>> virtualization-specific features in tun.

( broken too-long lines up. please do not merge them. )

> Tun has code to support some virtualization-specific features, however,  
> it has also some inherent problems, I think, for example, you don't know  
> over which NIC eventually a packet will be sent and as such, the feature  
> advertising to the guest (virtio-net) NIC is problematic,
> for example,  
> TSO. With vhost, since you are directly attached to a NIC and assuming  
> its a PF or VF NIC and not something like macvlan/veth you can actually  
> know what features are supported by this NIC.
>
> Or.

Herbert addressed the TSO example.

Generally, feature negotiation does become more complicated in bridged
configurations, but some users require bridging. So with vhost, feature
negotiation is mostly done in userspace (e.g. vhost does not expose a
TSO cpability, devices do this already); vhost itself only cares about
virtio features such as mergeable buffers.
Policy decisions, including whether to use packet socket or
tun+bridge, are up to the user.

-- 
MST

^ permalink raw reply

* Re: ipv4 regression in 2.6.31 ?
From: Eric Dumazet @ 2009-09-14 15:21 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-kernel, davem, Linux Netdev List
In-Reply-To: <20090914171001.47371b3d.skraw@ithnet.com>

Stephan von Krawczynski a écrit :
> On Mon, 14 Sep 2009 15:57:03 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
>> Stephan von Krawczynski a écrit :
>>> Hello all,
>>>
>>> today we experienced some sort of regression in 2.6.31 ipv4 implementation, or
>>> at least some incompatibility with former 2.6.30.X kernels.
>>>
>>> We have the following situation:
>>>
>>>                                        ---------- vlan1@eth0 192.168.2.1/24
>>>                                       /
>>> host A 192.168.1.1/24 eth0  -------<router>            host B
>>>                                       \
>>>                                        ---------- eth1 192.168.3.1/24
>>>
>>>
>>> Now, if you route 192.168.1.0/24 via interface vlan1@eth0 on host B and let
>>> host A ping 192.168.2.1 everything works. But if you route 192.168.1.0/24 via
>>> interface eth1 on host B and let host A ping 192.168.2.1 you get no reply.
>>> With tcpdump we see the icmp packets arrive at vlan1@eth0, but no icmp echo
>>> reply being generated neither on vlan1 nor eth1.
>>> Kernels 2.6.30.X and below do not show this behaviour.
>>> Is this intended? Do we need to reconfigure something to restore the old
>>> behaviour?
>>>
>> Asymetric routing ?
>>
>> Check your rp_filter settings
>>
>> grep . `find /proc/sys/net -name rp_filter`
>>
>> rp_filter - INTEGER
>>         0 - No source validation.
>>         1 - Strict mode as defined in RFC3704 Strict Reverse Path
>>             Each incoming packet is tested against the FIB and if the interface
>>             is not the best reverse path the packet check will fail.
>>             By default failed packets are discarded.
>>         2 - Loose mode as defined in RFC3704 Loose Reverse Path
>>             Each incoming packet's source address is also tested against the FIB
>>             and if the source address is not reachable via any interface
>>             the packet check will fail.
>>
>>         Current recommended practice in RFC3704 is to enable strict mode
>>         to prevent IP spoofing from DDos attacks. If using asymmetric routing
>>         or other complicated routing, then loose mode is recommended.
>>
>>         conf/all/rp_filter must also be set to non-zero to do source validation
>>         on the interface
>>
>>         Default value is 0. Note that some distributions enable it
>>         in startup scripts.
> 
> Problem is this:
> Kernel 2.6.30.X and below work flawlessly in this setup, only kernel 2.6.31
> acts different. Is this an intended change in policy?
> 

Here, it only depends on rp_filter settings, kernel 2.6.30 or 2.6.31

Please give your settings for further investigations, for all hosts involved.

^ permalink raw reply

* Re: ipv4 regression in 2.6.31 ?
From: Stephan von Krawczynski @ 2009-09-14 15:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, davem, Linux Netdev List
In-Reply-To: <4AAE4BAF.2010406@gmail.com>

On Mon, 14 Sep 2009 15:57:03 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Stephan von Krawczynski a écrit :
> > Hello all,
> > 
> > today we experienced some sort of regression in 2.6.31 ipv4 implementation, or
> > at least some incompatibility with former 2.6.30.X kernels.
> > 
> > We have the following situation:
> > 
> >                                        ---------- vlan1@eth0 192.168.2.1/24
> >                                       /
> > host A 192.168.1.1/24 eth0  -------<router>            host B
> >                                       \
> >                                        ---------- eth1 192.168.3.1/24
> > 
> > 
> > Now, if you route 192.168.1.0/24 via interface vlan1@eth0 on host B and let
> > host A ping 192.168.2.1 everything works. But if you route 192.168.1.0/24 via
> > interface eth1 on host B and let host A ping 192.168.2.1 you get no reply.
> > With tcpdump we see the icmp packets arrive at vlan1@eth0, but no icmp echo
> > reply being generated neither on vlan1 nor eth1.
> > Kernels 2.6.30.X and below do not show this behaviour.
> > Is this intended? Do we need to reconfigure something to restore the old
> > behaviour?
> > 
> 
> Asymetric routing ?
> 
> Check your rp_filter settings
> 
> grep . `find /proc/sys/net -name rp_filter`
> 
> rp_filter - INTEGER
>         0 - No source validation.
>         1 - Strict mode as defined in RFC3704 Strict Reverse Path
>             Each incoming packet is tested against the FIB and if the interface
>             is not the best reverse path the packet check will fail.
>             By default failed packets are discarded.
>         2 - Loose mode as defined in RFC3704 Loose Reverse Path
>             Each incoming packet's source address is also tested against the FIB
>             and if the source address is not reachable via any interface
>             the packet check will fail.
> 
>         Current recommended practice in RFC3704 is to enable strict mode
>         to prevent IP spoofing from DDos attacks. If using asymmetric routing
>         or other complicated routing, then loose mode is recommended.
> 
>         conf/all/rp_filter must also be set to non-zero to do source validation
>         on the interface
> 
>         Default value is 0. Note that some distributions enable it
>         in startup scripts.

Problem is this:
Kernel 2.6.30.X and below work flawlessly in this setup, only kernel 2.6.31
acts different. Is this an intended change in policy?

-- 
Regards,
Stephan

^ permalink raw reply

* Re: [PATCH RFC] tun: export underlying socket
From: Herbert Xu @ 2009-09-14 15:03 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Michael S. Tsirkin, David Miller, netdev
In-Reply-To: <4AAE4DFC.9080500@voltaire.com>

On Mon, Sep 14, 2009 at 05:06:52PM +0300, Or Gerlitz wrote:
>
>> A lot of people have asked for tun support in vhost, because qemu currently uses tun.  With this scheme existing code and scripts can be used to configure both tun and bridge.  You also can utilize virtualization-specific features in tun.
> Tun has code to support some virtualization-specific features, however,  
> it has also some inherent problems, I think, for example, you don't know  
> over which NIC eventually a packet will be sent and as such, the feature  
> advertising to the guest (virtio-net) NIC is problematic, for example,  
> TSO. With vhost, since you are directly attached to a NIC and assuming  
> its a PF or VF NIC and not something like macvlan/veth you can actually  
> know what features are supported by this NIC.

TSO is not a problem because we provide a software fallback when
the hardware does not support it.  So guests should always enable
TSO if they support it and not worry about the physical NIC.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* RE: [PATCH 29/29] ioat2, 3: cacheline align software descriptor allocations
From: Sosnowski, Maciej @ 2009-09-14 15:02 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org,
	netdev@vger.kernel.org
In-Reply-To: <20090904023257.32667.53926.stgit@dwillia2-linux.ch.intel.com>

Williams, Dan J wrote:
> All the necessary fields for handling an ioat2,3 ring entry can fit into
> one cacheline.  Move ->len prior to ->txd in struct ioat_ring_ent, and
> move allocation of these entries to a hw-cache-aligned kmem cache to
> reduce the number of cachelines dirtied for descriptor management.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---

Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>

^ permalink raw reply

* RE: [PATCH 28/29] dmaengine: kill tx_list
From: Sosnowski, Maciej @ 2009-09-14 15:01 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org,
	netdev@vger.kernel.org
In-Reply-To: <20090904023252.32667.51136.stgit@dwillia2-linux.ch.intel.com>

Williams, Dan J wrote:
> The tx_list attribute of struct dma_async_tx_descriptor is common to
> most, but not all dma driver implementations.  None of the upper level
> code (dmaengine/async_tx) uses it, so allow drivers to implement it
> locally if they need it.  This saves sizeof(struct list_head) bytes for
> drivers that do not manage descriptors with a linked list (e.g.: ioatdma
> v2,3).
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---

Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>

^ permalink raw reply

* RE: [PATCH 26/29] ioat: implement a private tx_list
From: Sosnowski, Maciej @ 2009-09-14 15:01 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org,
	netdev@vger.kernel.org
In-Reply-To: <20090904023242.32667.27473.stgit@dwillia2-linux.ch.intel.com>

Williams, Dan J wrote:
> Drop ioatdma's use of tx_list from struct dma_async_tx_descriptor in
> preparation for removal of this field.
> 
> Cc: Maciej Sosnowski <maciej.sosnowski@intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---

Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>

^ permalink raw reply

* RE: [PATCH 22/29] net_dma: poll for a descriptor after allocation failure
From: Sosnowski, Maciej @ 2009-09-14 15:00 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org,
	netdev@vger.kernel.org
In-Reply-To: <20090904023221.32667.70000.stgit@dwillia2-linux.ch.intel.com>

Williams, Dan J wrote:
> Handle descriptor allocation failures by polling for a descriptor.  The
> driver will force forward progress when polled.  In the best case this
> polling interval will be the time it takes for one dma memcpy
> transaction to complete.  In the worst case, channel hang, we will need
> to wait 100ms for the cleanup watchdog to fire (ioatdma driver).
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---

Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>

^ permalink raw reply

* RE: [PATCH 21/29] ioat2,3: dynamically resize descriptor ring
From: Sosnowski, Maciej @ 2009-09-14 15:00 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org,
	netdev@vger.kernel.org
In-Reply-To: <20090904023216.32667.55942.stgit@dwillia2-linux.ch.intel.com>

Williams, Dan J wrote:
> Increment the allocation order of the descriptor ring every time we run
> out of descriptors up to a maximum of allocation order specified by the
> module parameter 'ioat_max_alloc_order'.  After each idle period
> decrement the allocation order to a minimum order of
> 'ioat_ring_alloc_order' (i.e. the default ring size, tunable as a module
> parameter).
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---

Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>

Just one thing:

> +static int ioat_ring_max_alloc_order = IOAT_MAX_ORDER;
> +module_param(ioat_ring_max_alloc_order, int, 0644);
> +MODULE_PARM_DESC(ioat_ring_max_alloc_order,
> +		 "ioat2+: upper limit for dynamic ring resizing (default: n=16)");
[...]
> --- a/drivers/dma/ioat/dma_v2.h
> +++ b/drivers/dma/ioat/dma_v2.h
> @@ -37,6 +37,8 @@ extern int ioat_pending_level;
>  #define IOAT_MAX_ORDER 16
>  #define ioat_get_alloc_order() \
>  	(min(ioat_ring_alloc_order, IOAT_MAX_ORDER))
> +#define ioat_get_max_alloc_order() \
> +	(min(ioat_ring_max_alloc_order, IOAT_MAX_ORDER))

Making the max_alloc_order a module parameter gives impression
that it can be modified by an user, including making it larger than default.
The default is however its maximum value, which may be confusing.
Why not to use parameter only as the upper limit?

Thanks,
Maciej

^ permalink raw reply

* RE: [PATCH 20/29] ioat: switch watchdog and reset handler from workqueue to timer
From: Sosnowski, Maciej @ 2009-09-14 14:59 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org,
	netdev@vger.kernel.org
In-Reply-To: <20090904023211.32667.37259.stgit@dwillia2-linux.ch.intel.com>

Williams, Dan J wrote:
> In order to support dynamic resizing of the descriptor ring or polling
> for a descriptor in the presence of a hung channel the reset handler
> needs to make progress while in a non-preemptible context.  The current
> workqueue implementation precludes polling channel reset completion
> under spin_lock().
> 
> This conversion also allows us to return to opportunistic cleanup in the
> ioat2 case as the timer implementation guarantees at least one cleanup
> after every descriptor is submitted.  This means the worst case
> completion latency becomes the timer frequency (for exceptional
> circumstances), but with the benefit of avoiding busy waiting when the
> lock is contended.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---

[...]
> --- a/drivers/dma/ioat/dma_v2.c
> +++ b/drivers/dma/ioat/dma_v2.c
> @@ -49,7 +49,7 @@ static void __ioat2_issue_pending(struct ioat2_dma_chan *ioat)
>  	void * __iomem reg_base = ioat->base.reg_base;
> 
>  	ioat->pending = 0;
> -	ioat->dmacount += ioat2_ring_pending(ioat);
> +	ioat->dmacount += ioat2_ring_pending(ioat);;
double semicolon

Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>

^ permalink raw reply

* RE: [PATCH 19/29] ioat1: trim ioat_dma_desc_sw
From: Sosnowski, Maciej @ 2009-09-14 14:55 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org,
	netdev@vger.kernel.org
In-Reply-To: <20090904023206.32667.35974.stgit@dwillia2-linux.ch.intel.com>

Williams, Dan J wrote:
> Save 4 bytes per software descriptor by transmitting tx_cnt in an unused
> portion of the hardware descriptor.
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---

Signed-off-by: Maciej Sosnowski <maciej.sosnowski@intel.com>

^ permalink raw reply

* Re: [PATCH 4/4] bonding: add sysfs files to display tlb and alb hash table contents
From: Andy Gospodarek @ 2009-09-14 14:45 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: Andy Gospodarek, netdev, bonding-devel
In-Reply-To: <26430.1252705697@death.nxdomain.ibm.com>

On Fri, Sep 11, 2009 at 02:48:17PM -0700, Jay Vosburgh wrote:
> Andy Gospodarek <andy@greyhouse.net> wrote:
> 
> >bonding: add sysfs files to display tlb and alb hash table contents
> 
> 	Isn't it considered bad form to have sysfs files that kick out
> large amounts of data like this?  Not that I think this is a bad
> facility to have, just checking on the mechanism.
> 

I'm not aware of such a restriction -- though I'm sure at least one
person out there doesn't like it.

If that's the case, there are certainly a few files that should be
cleaned up:

# find -type f -exec wc -l {} 2> /dev/null \; | sort -r -n | head -10
1657 ./firmware/acpi/tables/SSDT
132 ./firmware/acpi/tables/dynamic/SSDT2
128 ./devices/pci0000:00/0000:00:1c.5/0000:3f:00.0/vpd
27 ./devices/system/node/node0/meminfo
24 ./devices/pnp0/00:08/options
24 ./devices/pnp0/00:07/options
12 ./devices/pci0000:00/0000:00:1e.0/resource
12 ./devices/pci0000:00/0000:00:1c.5/resource
12 ./devices/pci0000:00/0000:00:1c.4/resource
12 ./devices/pci0000:00/0000:00:1c.0/resource


> >While debugging some problems with alb (mode 6) bonding I realized that
> >being able to output the contents of both hash tables would be helpful.
> >This is what the output looks like for the two files:
> >
> >device  load
> >eth1    491
> >eth2    491
> >hash device   last device   tx bytes       load        next previous
> >2    eth1     eth1          2254           491         0    0
> >3    eth2     eth2          2744           491         0    0
> >6             eth2          0              488         0    0
> >8             eth2          0              461698      0    0
> >1b            eth2          0              249         0    0
> >eb            eth2          0              21          0    0
> >ff            eth2          0              22          0    0
> >
> >hash ip_src          ip_dst          mac_dst           slave assign ntt
> >2    10.0.3.2        10.0.3.11       00:e0:81:71:ee:a9 eth1  1      0
> >3    10.0.3.2        10.0.3.10       00:e0:81:71:ee:a9 eth2  1      0
> >8    10.0.3.2        10.0.3.1        00:e0:81:71:ee:a9 eth2  1      0
> >
> >These were a great help debugging the fixes I have just posted and they
> >might be helpful for others, so I decided to include them in my
> >patchset.
> >
> >Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
> >
> >---
> > drivers/net/bonding/bond_alb.c   |   61 ++++++++++++++++++++++++++++++++++++++
> > drivers/net/bonding/bond_alb.h   |    2 +
> > drivers/net/bonding/bond_sysfs.c |   40 +++++++++++++++++++++++++
> > 3 files changed, 103 insertions(+), 0 deletions(-)
> >
> >diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
> >index 7db8835..4e930e3 100644
> >--- a/drivers/net/bonding/bond_alb.c
> >+++ b/drivers/net/bonding/bond_alb.c
> >@@ -778,6 +778,67 @@ static struct slave *rlb_arp_xmit(struct sk_buff *skb, struct bonding *bond)
> > 	return tx_slave;
> > }
> >
> >+int rlb_print_rx_hashtbl(struct bonding *bond, char *buf)
> >+{
> >+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
> >+	struct rlb_client_info *client_info;
> >+	u32 hash_index;
> >+	u32 count = 0;
> >+	
> >+	_lock_rx_hashtbl(bond);
> >+
> >+	count = sprintf(buf, "hash ip_src          ip_dst          mac_dst           slave assign ntt\n");
> >+	hash_index = bond_info->rx_hashtbl_head;
> >+	for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
> >+		client_info = &(bond_info->rx_hashtbl[hash_index]);
> >+		count += sprintf(buf + count,"%-4x %-15pi4 %-15pi4 %pM %-5s %-6d %d\n",
> >+				 hash_index,
> >+				 &client_info->ip_src,
> >+				 &client_info->ip_dst,
> >+				 client_info->mac_dst,
> >+				 client_info->slave->dev->name,
> >+				 client_info->assigned,
> >+				 client_info->ntt);
> >+	}
> >+
> >+	_unlock_rx_hashtbl(bond);
> >+	return count;
> >+}
> >+
> >+int tlb_print_tx_hashtbl(struct bonding *bond, char *buf)
> >+{
> >+	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
> >+	u32 hash_index;
> >+	u32 count = 0;
> >+	struct slave *slave;
> >+	int i;
> >+	
> >+	_lock_tx_hashtbl(bond);
> >+
> >+	count += sprintf(buf, "device  load\n");
> >+	bond_for_each_slave(bond, slave, i) {
> >+		struct tlb_slave_info *slave_info = &(SLAVE_TLB_INFO(slave));
> >+		count += sprintf(buf + count,"%-7s %d\n",slave->dev->name,slave_info->load);
> >+	}
> >+	count += sprintf(buf + count, "hash device   last device   tx bytes       load        next previous\n");
> >+	for (hash_index = 0; hash_index < TLB_HASH_TABLE_SIZE; hash_index++) {
> >+		struct tlb_client_info *client_info = &(bond_info->tx_hashtbl[hash_index]);
> >+		if (client_info->tx_slave || client_info->last_slave) {
> >+			count += sprintf(buf + count,"%-4x %-8s %-13s %-14d %-11d %-4x %d\n",
> >+					 hash_index,
> >+					 (client_info->tx_slave) ? client_info->tx_slave->dev->name : "",
> >+					 (client_info->last_slave) ? client_info->last_slave->dev->name : "",
> >+					 client_info->tx_bytes,
> >+					 client_info->load_history,
> >+					 (client_info->next != TLB_NULL_INDEX) ? client_info->next : 0,
> >+					 (client_info->prev != TLB_NULL_INDEX) ? client_info->prev : 0);
> >+		}
> >+	}
> >+
> >+	_unlock_tx_hashtbl(bond);
> >+	return count;
> >+}
> >+
> > /* Caller must hold rx_hashtbl lock */
> > static void rlb_init_table_entry(struct rlb_client_info *entry)
> > {
> >diff --git a/drivers/net/bonding/bond_alb.h b/drivers/net/bonding/bond_alb.h
> >index b65fd29..8543447 100644
> >--- a/drivers/net/bonding/bond_alb.h
> >+++ b/drivers/net/bonding/bond_alb.h
> >@@ -132,5 +132,7 @@ int bond_alb_xmit(struct sk_buff *skb, struct net_device *bond_dev);
> > void bond_alb_monitor(struct work_struct *);
> > int bond_alb_set_mac_address(struct net_device *bond_dev, void *addr);
> > void bond_alb_clear_vlan(struct bonding *bond, unsigned short vlan_id);
> >+int rlb_print_rx_hashtbl(struct bonding *bond, char *buf);
> >+int tlb_print_tx_hashtbl(struct bonding *bond, char *buf);
> > #endif /* __BOND_ALB_H__ */
> >
> >diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
> >index 55bf34f..1123e1f 100644
> >--- a/drivers/net/bonding/bond_sysfs.c
> >+++ b/drivers/net/bonding/bond_sysfs.c
> >@@ -1480,6 +1480,44 @@ static ssize_t bonding_show_ad_partner_mac(struct device *d,
> > static DEVICE_ATTR(ad_partner_mac, S_IRUGO, bonding_show_ad_partner_mac, NULL);
> >
> >
> >+/*
> >+ * Show current tlb/alb tx hash table.
> >+ */
> >+static ssize_t bonding_show_tlb_tx_hash(struct device *d,
> >+					   struct device_attribute *attr,
> >+					   char *buf)
> >+{
> >+	int count = 0;
> >+	struct bonding *bond = to_bond(d);
> >+
> >+	if (bond->params.mode == BOND_MODE_ALB ||
> >+	    bond->params.mode == BOND_MODE_TLB) {
> >+		count = tlb_print_tx_hashtbl(bond, buf);
> >+	}
> >+
> >+	return count;
> >+}
> >+static DEVICE_ATTR(tlb_tx_hash, S_IRUGO, bonding_show_tlb_tx_hash, NULL);
> 
> 	Should the mode here be S_IRUSR (0400, instead of 0444)?
> Otherwise, a nefarious user could "while 1 cat /sys/.../tlb_tx_hash" and
> keep the hash table lock fairly busy.  Since the lock is acquired for
> every packet on tx, that's probably a bad thing.
> 
> >+
> >+/*
> >+ * Show current alb rx hash table.
> >+ */
> >+static ssize_t bonding_show_alb_rx_hash(struct device *d,
> >+					   struct device_attribute *attr,
> >+					   char *buf)
> >+{
> >+	int count = 0;
> >+	struct bonding *bond = to_bond(d);
> >+
> >+	if (bond->params.mode == BOND_MODE_ALB) {
> >+		count = rlb_print_rx_hashtbl(bond, buf);
> >+	}
> >+
> >+	return count;
> >+}
> >+static DEVICE_ATTR(alb_rx_hash, S_IRUGO, bonding_show_alb_rx_hash, NULL);
> 
> 	Same comment as for the mode of the tlb_tx_hash, although the rx
> hash table lock is much more lightly used, so it might not be a real
> problem.
> 
> >
> > static struct attribute *per_bond_attrs[] = {
> > 	&dev_attr_slaves.attr,
> >@@ -1505,6 +1543,8 @@ static struct attribute *per_bond_attrs[] = {
> > 	&dev_attr_ad_actor_key.attr,
> > 	&dev_attr_ad_partner_key.attr,
> > 	&dev_attr_ad_partner_mac.attr,
> >+	&dev_attr_alb_rx_hash.attr,
> >+	&dev_attr_tlb_tx_hash.attr,
> > 	NULL,
> > };
> >
> >-- 
> >1.5.5.6
> >
> 
> 	-J
> 
> ---
> 	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH] tcp: fix ssthresh u16 leftover
From: Ilpo Järvinen @ 2009-09-14 14:09 UTC (permalink / raw)
  To: David Miller; +Cc: Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4491 bytes --]

It was once upon time so that snd_sthresh was a 16-bit quantity.
...That has not been true for long period of time. I run across
some ancient compares which still seem to trust such legacy.
Put all that magic into a single place, I hopefully found all
of them.

Compile tested, though linking of allyesconfig is ridiculous
nowadays it seems.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

---
  include/net/tcp.h        |    7 +++++++
  net/ipv4/tcp.c           |    2 +-
  net/ipv4/tcp_input.c     |    2 +-
  net/ipv4/tcp_ipv4.c      |    4 ++--
  net/ipv4/tcp_minisocks.c |    2 +-
  net/ipv6/tcp_ipv6.c      |    5 +++--
  6 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index b71a446..56b7602 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -793,6 +793,13 @@ static inline unsigned int tcp_packets_in_flight(const struct tcp_sock *tp)
  	return tp->packets_out - tcp_left_out(tp) + tp->retrans_out;
  }

+#define TCP_INFINITE_SSTHRESH	0x7fffffff
+
+static inline bool tcp_in_initial_slowstart(const struct tcp_sock *tp)
+{
+	return tp->snd_ssthresh >= TCP_INFINITE_SSTHRESH;
+}
+
  /* If cwnd > ssthresh, we may raise ssthresh to be half-way to cwnd.
   * The exception is rate halving phase, when cwnd is decreasing towards
   * ssthresh.
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index edeea06..19a0612 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2012,7 +2012,7 @@ int tcp_disconnect(struct sock *sk, int flags)
  	tp->snd_cwnd = 2;
  	icsk->icsk_probes_out = 0;
  	tp->packets_out = 0;
-	tp->snd_ssthresh = 0x7fffffff;
+	tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
  	tp->snd_cwnd_cnt = 0;
  	tp->bytes_acked = 0;
  	tcp_set_ca_state(sk, TCP_CA_Open);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index af6d6fa..d86784b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -761,7 +761,7 @@ void tcp_update_metrics(struct sock *sk)
  			set_dst_metric_rtt(dst, RTAX_RTTVAR, var);
  		}

-		if (tp->snd_ssthresh >= 0xFFFF) {
+		if (tcp_in_initial_slowstart(tp)) {
  			/* Slow start still did not finish. */
  			if (dst_metric(dst, RTAX_SSTHRESH) &&
  			    !dst_metric_locked(dst, RTAX_SSTHRESH) &&
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 0543561..7cda24b 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1808,7 +1808,7 @@ static int tcp_v4_init_sock(struct sock *sk)
  	/* See draft-stevens-tcpca-spec-01 for discussion of the
  	 * initialization of these values.
  	 */
-	tp->snd_ssthresh = 0x7fffffff;	/* Infinity */
+	tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
  	tp->snd_cwnd_clamp = ~0;
  	tp->mss_cache = 536;

@@ -2284,7 +2284,7 @@ static void get_tcp4_sock(struct sock *sk, struct seq_file *f, int i, int *len)
  		jiffies_to_clock_t(icsk->icsk_ack.ato),
  		(icsk->icsk_ack.quick << 1) | icsk->icsk_ack.pingpong,
  		tp->snd_cwnd,
-		tp->snd_ssthresh >= 0xFFFF ? -1 : tp->snd_ssthresh,
+		tcp_in_initial_slowstart(tp) ? -1 : tp->snd_ssthresh,
  		len);
  }

diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index e48c37d..045bcfd 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -410,7 +410,7 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
  		newtp->retrans_out = 0;
  		newtp->sacked_out = 0;
  		newtp->fackets_out = 0;
-		newtp->snd_ssthresh = 0x7fffffff;
+		newtp->snd_ssthresh = TCP_INFINITE_SSTHRESH;

  		/* So many TCP implementations out there (incorrectly) count the
  		 * initial SYN frame in their delayed-ACK and congestion control
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 3aae0f2..6e3f0dc 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1846,7 +1846,7 @@ static int tcp_v6_init_sock(struct sock *sk)
  	/* See draft-stevens-tcpca-spec-01 for discussion of the
  	 * initialization of these values.
  	 */
-	tp->snd_ssthresh = 0x7fffffff;
+	tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
  	tp->snd_cwnd_clamp = ~0;
  	tp->mss_cache = 536;

@@ -1969,7 +1969,8 @@ static void get_tcp6_sock(struct seq_file *seq, struct sock *sp, int i)
  		   jiffies_to_clock_t(icsk->icsk_rto),
  		   jiffies_to_clock_t(icsk->icsk_ack.ato),
  		   (icsk->icsk_ack.quick << 1 ) | icsk->icsk_ack.pingpong,
-		   tp->snd_cwnd, tp->snd_ssthresh>=0xFFFF?-1:tp->snd_ssthresh
+		   tp->snd_cwnd,
+		   tcp_in_initial_slowstart(tp) ? -1 : tp->snd_ssthresh
  		   );
  }

-- 
tg: (13af7a6..) fix/ssthresh (depends on: origin/master)

^ permalink raw reply related

* Re: [PATCH 1/8] networking/fanotify: declare fanotify socket numbers
From: Evgeniy Polyakov @ 2009-09-14 14:07 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Eric Paris, David Miller, linux-kernel, linux-fsdevel, netdev,
	viro, alan, hch
In-Reply-To: <20090914001759.GB30621@shareable.org>

On Mon, Sep 14, 2009 at 01:17:59AM +0100, Jamie Lokier (jamie@shareable.org) wrote:
> > When queue is full or you do not have enough RAM. Both are reported at
> > 'sending' time.
> 
> Can you ->poll() and wait reliably until the queue will accept an skb?
> (A few spurious EAGAINs/ENOBUFs is ok, as long as it's not the norm).

Not that simple and for memory allocation error just can't.

There is no direct access to remote peer sockets, i.e. to userspace
ones, so one will have to lock netlink table and run over listeners and
check whether they can accept the message and wait/poll for queue size
to become big enough. Netlink table and its locking is not exported,
so effectively there is no simple way to do this.

-- 
	Evgeniy Polyakov

^ permalink raw reply

* Re: [PATCH RFC] tun: export underlying socket
From: Or Gerlitz @ 2009-09-14 14:06 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: David Miller, netdev, herbert
In-Reply-To: <20090914101012.GA14176@redhat.com>

Michael S. Tsirkin wrote:
>> how  would the use case with vhost will look like?
> - Configure bridge and tun using existing scripts
> - pass tun fd to vhost via an ioctl
> - vhost calls tun_get_socket
> - from this point, guest networking just goes faster

let me see I am with you:

1. vhost gets from user space through ioctl packet socket fd OR tun fd - 
but never both

2. for packet socket fd
VM.TX is translated by vhost to sendmsg which goes through the NIC
NIC RX  makes the fd poll to signal and then recvmsg is called on the 
fd, then vhost places the packet in a virtq

3. for tun fd
VM.TX is translated by vhost to sendmsg which is translated by tun to 
netif_rx which is then handled by the bridge
NIC RX  goes to the bridge which xmits the packet a tun interface, now 
what makes tun provide this packet to vhost and how it is done?

> A lot of people have asked for tun support in vhost, because qemu currently uses tun.  With this scheme existing code and scripts can be used to configure both tun and bridge.  You also can utilize virtualization-specific features in tun.
Tun has code to support some virtualization-specific features, however, 
it has also some inherent problems, I think, for example, you don't know 
over which NIC eventually a packet will be sent and as such, the feature 
advertising to the guest (virtio-net) NIC is problematic, for example, 
TSO. With vhost, since you are directly attached to a NIC and assuming 
its a PF or VF NIC and not something like macvlan/veth you can actually 
know what features are supported by this NIC.

Or.

^ permalink raw reply

* Re: ipv4 regression in 2.6.31 ?
From: Eric Dumazet @ 2009-09-14 13:57 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-kernel, davem, Linux Netdev List
In-Reply-To: <20090914150935.cc895a3c.skraw@ithnet.com>

Stephan von Krawczynski a écrit :
> Hello all,
> 
> today we experienced some sort of regression in 2.6.31 ipv4 implementation, or
> at least some incompatibility with former 2.6.30.X kernels.
> 
> We have the following situation:
> 
>                                        ---------- vlan1@eth0 192.168.2.1/24
>                                       /
> host A 192.168.1.1/24 eth0  -------<router>            host B
>                                       \
>                                        ---------- eth1 192.168.3.1/24
> 
> 
> Now, if you route 192.168.1.0/24 via interface vlan1@eth0 on host B and let
> host A ping 192.168.2.1 everything works. But if you route 192.168.1.0/24 via
> interface eth1 on host B and let host A ping 192.168.2.1 you get no reply.
> With tcpdump we see the icmp packets arrive at vlan1@eth0, but no icmp echo
> reply being generated neither on vlan1 nor eth1.
> Kernels 2.6.30.X and below do not show this behaviour.
> Is this intended? Do we need to reconfigure something to restore the old
> behaviour?
> 

Asymetric routing ?

Check your rp_filter settings

grep . `find /proc/sys/net -name rp_filter`

rp_filter - INTEGER
        0 - No source validation.
        1 - Strict mode as defined in RFC3704 Strict Reverse Path
            Each incoming packet is tested against the FIB and if the interface
            is not the best reverse path the packet check will fail.
            By default failed packets are discarded.
        2 - Loose mode as defined in RFC3704 Loose Reverse Path
            Each incoming packet's source address is also tested against the FIB
            and if the source address is not reachable via any interface
            the packet check will fail.

        Current recommended practice in RFC3704 is to enable strict mode
        to prevent IP spoofing from DDos attacks. If using asymmetric routing
        or other complicated routing, then loose mode is recommended.

        conf/all/rp_filter must also be set to non-zero to do source validation
        on the interface

        Default value is 0. Note that some distributions enable it
        in startup scripts.



^ permalink raw reply

* Re: [iproute2] tc action mirred    question
From: Xiaofei Wu @ 2009-09-14 13:44 UTC (permalink / raw)
  To: hadi; +Cc: linux netdev
In-Reply-To: <1252704524.25158.42.camel@dogo.mojatatu.com>

>> 

>> How to do this. Could you show me the example commands?   Thank you.
>> 
>Add the rule to mirror on lo
>Add the rule to pedit for mirrored packet on eth0

I did two expriments. One is OK. The result of the other is not the same as I expected. I don't know why.

(1)
 A
| |
 C

A: eth0  192.168.1.242/24
   wlan1 192.168.4.5/24

C: wlan1 192.168.4.202/24
   eth0  192.168.1.215/24
On node A, I mirrored packets to wlan1(eth0 -> wlan1), modified dst,src MAC (transmit to wlan1 of node C).
When I run 'ping 192.168.1.215' on node A, one request will get two replies. It's OK.

(2)
 A
/ |
B |
\ |
 C

A: eth0  192.168.1.242/24
   wlan1 192.168.2.5/24

B: wlan1 192.168.2.11/24
   wlan2 192.168.4.11/24

C: wlan1 192.168.4.202/24
   eth0  192.168.1.215/24

On node A, I run this to mirror, pedit packets.
---
#tc qdisc add dev eth0 handle 1: root prio
#tc filter add dev eth0 parent 1: protocol ip prio 10 u32 \
match ip src 192.168.1.0/24 flowid 1:16 \
action mirred egress mirror dev wlan1

#tc qdisc add dev wlan1 handle 1: root prio
#tc filter add dev wlan1 parent 1: protocol ip prio 10 u32 \
match ip src 192.168.1.0/24 flowid 1:16 \
action pedit munge offset -14 u16 set 0x0023 \
munge offset -12 u32 set 0xcdafecda \
munge offset -8 u32 set 0x0023cdaf \
munge offset -4 u32 set 0xd0740800
---

the routing table 0f node B
---
#route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
192.168.4.0     0.0.0.0         255.255.255.0   U     0      0        0 wlan2
192.168.2.0     0.0.0.0         255.255.255.0   U     0      0        0 wlan1
0.0.0.0         192.168.4.202     0.0.0.0       UG    0      0        0 wlan2

#cat /proc/sys/net/ipv4/ip_forward
1
---

On node A I run 'ping 192.168.1.215'(IP addr of node C eth0) on node A, one request 'only' get one reply. It's strange.
On node B,
window1:  'tcpdump -i wlan1 -n -e', I can see the mirroring packets.
window2:  'tcpdump -i wlan2 -n -e', I see noting.
It seems that node B didn't forward the mirroring packects. So I did anotner experiment to check it. 
I am sure node B can forward packets. But it didn't forward the mirroring packets, why?  (something wrong with the mirroring packets?)

regards,
wu

^ permalink raw reply

* Re: [PATCH 1/8] networking/fanotify: declare fanotify socket numbers
From: jamal @ 2009-09-14 13:15 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Eric Paris, David Miller, linux-kernel, linux-fsdevel, netdev,
	viro, alan, hch, balbir
In-Reply-To: <20090914000303.GA30621@shareable.org>

On Mon, 2009-09-14 at 01:03 +0100, Jamie Lokier wrote:

> If you have enough memory to remember _what_ to retransmit, then you
> have enough memory to buffer a fixed-size message.  It just depends on
> how you do the buffering.  To say netlink drops the message and you
> can retry is just saying that the buffering is happening one step
> earlier, before netlink.  

it is the receiver that drops the message because of overruns
e.g. when receiver doesnt keep up..

> That's what I mean by netlink being a
> pointless complication for this, because you can just as easily write
> code which gets to the message to userspace without going through
> netlink and with no chance of it being dropped.
> 

Sure you can do that with netlink too. Whether it is overcomplicated
needs to be weighed out.

> Yes.  It uses positive acknowledge and flow control, because these
> match naturally with what fanotify does at the next higher level.
> 
> The process generating the multicast (e.g. trying to write a file) is
> blocked until the receiver gets the message, handles it and
> acknowledges with a "yes you can" or "no you can't" response.
> 
> That's part of fanotify's design.  The pattern conveniently has no
> issues with using unbounded memory for message, because the sending
> process is blocked.
> 

Ok, I understand better i think;-> So it is a synchronous type of
operation whereas in netlink type multicast, the optimization is to make
the operation async.

> True you only need one skb.  But netlink doesn't handle waiting for
> positive acknowledge responses from every receiver, and combining
> their value, does it?  

It is not netlink perse. It is how you use netlink for your app.
Classical one-to-many operations have the sender (kernel mostly)
do async sends to the listeners. It is up to the listener to catch
up if there are any holes. But this seems not what you want for
fanotify.

> You can't really take advantage of netlink's
> built in multicast, because to known when it has all the responses,
> the fanotify layer has to track the subscriber list itself anyway.

True, given my understanding so far fanotify has to track the subscriber
list. i.e something along the lines of:
- send a single multicast message to a set of listeners
- wait for response from all subscribers
- if no response from all subscribers given timeout then retransmit upto
max retransmit times

The chance of loosing a message in such a case is zero if the socket
buffer on each listener/receiver is larger than one fanotify event
message. You still have to alloc the message - and that may fail.

> What I'm saying is perhaps skbs are useful for fanotify, but I don't
> know that netlink's multicasting is useful.  But storing the messages
> in skbs for transmission, and using parts of netlink to manage them,
> and to provide some of the API, that might be useful.

The multicast-a-single-skb part is useful. Of course, its usefulness
diminishes as the number of listeners per subtree goes down (because it
reduces to one skb per listener). So all this depends on how fanotify is
going to be used.
The one thing i am not sure of is how you map a multicast group to a
subtree. In netlink groups to which multiple listeners subscribe to
are 32-bit identifiers. I suppose, one approach could be to register
for the event of interest, get an ID then use the ID to listen to a
multicast group of that ID. This way whoever is issuing the ID can also
factor in permissions and subtree overlap of the listener (and whether
the events are already being listened to in a known ID). 
Alternatively, to your statement above, if fanotify is keeping track of
all subsribers then it can replicast a single event instead and just
bump the refcount on the skb for each sent-to-user (and still use one
skb).. 

> You do get nothing unless you register interest.  The problem is
> there's no way to register interest on just a subtree, so the fanotify
> approach is let you register for events on the whole filesystem, and
> let the userspace daemon filter paths.  At least it's decisions can be
> cached, although I'm not sure how that works when multiple processes
> want to monitor overlapping parts of the filesystem.

I guess if the non-optimal part happens only once and subsequent cached
filters happen faster, then one could look at that as cost of setup.
I think, given that you are capable of creating such a cache, seems that
it would be cheaper to make such decision at registration time.

> It doesn't sound scalable to me, either, and that's why I don't like
> this part, and described a solution to monitoring subtrees - which
> would also solve the problem for inotify.  (Both use fsnotify under
> the hood, and that's where subtree notification would go).
>
> Eric's mentioned interest in a way to monitor subtrees, but that
> hasn't gone anywhere as far as I know.  He doesn't seem convinced by
> my solution - or even that scalability will be an issue.  I think
> there's a bit of vision lacking here, and I'll admit I'm more
> interested in the inotify uses of fsnotify (being able to detect
> changes) than the fanotify uses (being able to _block_ or _modify_
> changes).  I think both inotify and fanotify ought to benefit from the
> same improvements to file monitoring.
> 

The subtree overlap problem seems to invoke some well known computer
science algorithms, no? i.e tell me oracle given the event on nodeX of
this tree, which subscriber needs to be notified?

> I believe it would cause 10000 events, yes, even if they are files
> that userspace policy is not interested in.  Eric, is that right?
> 
> However I believe after the first grep, subsequent greps' decisions
> would be cached by marking the inodes.  I'm not sure what happens if
> two fanotify monitors both try marking the inodes.
>
> Arguably if a fanotify monitor is running before those files are in
> page cache anyway, then I/O may dominate, and when the files are
> cached, fanotify has already cached it's decisions in the kernel.
> However fanotify is synchronous: each new file access involves a round
> trip to the fanotify userspace and back before it can proceed, so
> there's quite a lot of IPC and scheduling too.  Without testing, it's
> hard to guess how it'll really perform.
> 

So if you can mark inodes, why not do it at register time?

> > > While skbs and netlink aren't that slow, I suspect they're an order of
> > > magnitude or two slower than, say, epoll or inotify at passing events
> > > around.
> > 
> > not familiar with inotify.
> 
> inotify is like dnotify, and like a signal or epoll: a message that
> something happened.  You register interest in individual files or
> directories only, and inotify does not (yet) provide a way to monitor
> the whole filesystem or a subtree.
> 
> fanotify is different: it provides access control, and can _refuse_
> attempts to read file X, or even modify the file before permitting the
> file to be read.
> 

Ok, I think i understood more about fanotify now. It is more of an
access control than a mass notification scheme (which is what i thought
of earlier).
Hrm, it does sound like something closer to selinux if it is simple
enough to require answers to simple questions like "should this
operation continue?"

> > Theres a difference between events which are abbreviated in the form
> > "hey some read happened on fd you are listening on" vs "hey a read
> > of file X for 16 bytes at offset 200 by process Y just occured while
> > at the same time process Z was writting at offset 2000". The later
> > (which netlink will give you) includes a lot more attribute details
> > which could be filtered or can be extended to include a lot
> > more. The former(what epoll will give you) is merely a signal.
> 
> Firstly, it's really hard to retain the ordering of userspace events
> like that in a useful way, given the non-determinstic parallelism
> going on with multiple processes doing I/O do the same file :-)
> 

Bad example ;->
That was not meant to be anything clever - rather to demonstrate that
netlink allows you to send many attributes with events and that you can
add as many as you want over a period of time (instead of hardcoding it
at design/coding time).

On a tangent: I would love to get more than simple events
(read/write/exception) on a file. Probably more on the writes
than on reads; example "offset X, length Y has been deleted" etc.
I would still love the option to exercise my rights to simple
events like read/write/exception

cheers,
jamal

^ permalink raw reply

* [PATCH] Phonet: Netlink event for autoconfigured addresses
From: Rémi Denis-Courmont @ 2009-09-14 13:10 UTC (permalink / raw)
  To: netdev; +Cc: Rémi Denis-Courmont

From: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>

From: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>

Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
---
 net/phonet/pn_dev.c |    9 ++++++++-
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/net/phonet/pn_dev.c b/net/phonet/pn_dev.c
index 2f65dca..5f42f30 100644
--- a/net/phonet/pn_dev.c
+++ b/net/phonet/pn_dev.c
@@ -209,7 +209,14 @@ static int phonet_device_autoconf(struct net_device *dev)
 						SIOCPNGAUTOCONF);
 	if (ret < 0)
 		return ret;
-	return phonet_address_add(dev, req.ifr_phonet_autoconf.device);
+
+	ASSERT_RTNL();
+	ret = phonet_address_add(dev, req.ifr_phonet_autoconf.device);
+	if (ret)
+		return ret;
+	phonet_address_notify(RTM_NEWADDR, dev,
+				req.ifr_phonet_autoconf.device);
+	return 0;
 }
 
 /* notify Phonet of device events */
-- 
1.6.0.4


^ permalink raw reply related

* [PATCH] cdc-phonet: remove noisy debug statement
From: Rémi Denis-Courmont @ 2009-09-14 13:10 UTC (permalink / raw)
  To: netdev; +Cc: Rémi Denis-Courmont
In-Reply-To: <1252933829-12442-1-git-send-email-remi@remlab.net>

From: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>

From: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>

Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
---
 drivers/net/usb/cdc-phonet.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/drivers/net/usb/cdc-phonet.c b/drivers/net/usb/cdc-phonet.c
index 97e54d9..33d5c57 100644
--- a/drivers/net/usb/cdc-phonet.c
+++ b/drivers/net/usb/cdc-phonet.c
@@ -264,7 +264,6 @@ static int usbpn_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
 	switch (cmd) {
 	case SIOCPNGAUTOCONF:
 		req->ifr_phonet_autoconf.device = PN_DEV_PC;
-		printk(KERN_CRIT"device is PN_DEV_PC\n");
 		return 0;
 	}
 	return -ENOIOCTLCMD;
-- 
1.6.0.4


^ permalink raw reply related

* [PATCH] pkt_sched: Fix tx queue selection in tc_modify_qdisc
From: Jarek Poplawski @ 2009-09-14 12:22 UTC (permalink / raw)
  To: David Miller; +Cc: Patrick McHardy, netdev

After the recent mq change there is the new select_queue qdisc class
method used in tc_modify_qdisc, but it works OK only for direct child
qdiscs of mq qdisc. Grandchildren always get the first tx queue, which
would give wrong qdisc_root etc. results (e.g. for sch_htb as child of
sch_prio). This patch fixes it by using parent's dev_queue for such
grandchildren qdiscs. The select_queue method is replaced BTW with the
static qdisc_select_tx_queue function (it's used only in one place).

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
---

 include/net/sch_generic.h |    1 -
 net/sched/sch_api.c       |   29 +++++++++++++++++++++--------
 net/sched/sch_mq.c        |   10 ----------
 3 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 88eb9de..865120c 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -81,7 +81,6 @@ struct Qdisc
 struct Qdisc_class_ops
 {
 	/* Child qdisc manipulation */
-	unsigned int		(*select_queue)(struct Qdisc *, struct tcmsg *);
 	int			(*graft)(struct Qdisc *, unsigned long cl,
 					struct Qdisc *, struct Qdisc **);
 	struct Qdisc *		(*leaf)(struct Qdisc *, unsigned long cl);
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 3af1061..223a6bc 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -990,6 +990,24 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n, void *arg)
 	return 0;
 }
 
+static struct netdev_queue *qdisc_select_tx_queue(struct net_device *dev,
+						  struct Qdisc *p, u32 clid)
+{
+	unsigned long ntx;
+
+	if (!p)
+		return netdev_get_tx_queue(dev, 0);
+
+	if (!(p->flags & TCQ_F_MQROOT))
+		return p->dev_queue;
+
+	ntx = TC_H_MIN(clid) - 1;
+	if (ntx >= dev->num_tx_queues)
+		ntx = 0;
+
+	return netdev_get_tx_queue(dev, ntx);
+}
+
 /*
    Create/change qdisc.
  */
@@ -1110,16 +1128,11 @@ create_n_graft:
 		q = qdisc_create(dev, &dev->rx_queue, p,
 				 tcm->tcm_parent, tcm->tcm_parent,
 				 tca, &err);
-	else {
-		unsigned int ntx = 0;
-
-		if (p && p->ops->cl_ops && p->ops->cl_ops->select_queue)
-			ntx = p->ops->cl_ops->select_queue(p, tcm);
-
-		q = qdisc_create(dev, netdev_get_tx_queue(dev, ntx), p,
+	else
+		q = qdisc_create(dev, qdisc_select_tx_queue(dev, p, clid), p,
 				 tcm->tcm_parent, tcm->tcm_handle,
 				 tca, &err);
-	}
+
 	if (q == NULL) {
 		if (err == -EAGAIN)
 			goto replay;
diff --git a/net/sched/sch_mq.c b/net/sched/sch_mq.c
index dd5ee02..4ad949b 100644
--- a/net/sched/sch_mq.c
+++ b/net/sched/sch_mq.c
@@ -125,15 +125,6 @@ static struct netdev_queue *mq_queue_get(struct Qdisc *sch, unsigned long cl)
 	return netdev_get_tx_queue(dev, ntx);
 }
 
-static unsigned int mq_select_queue(struct Qdisc *sch, struct tcmsg *tcm)
-{
-	unsigned int ntx = TC_H_MIN(tcm->tcm_parent);
-
-	if (!mq_queue_get(sch, ntx))
-		return 0;
-	return ntx - 1;
-}
-
 static int mq_graft(struct Qdisc *sch, unsigned long cl, struct Qdisc *new,
 		    struct Qdisc **old)
 {
@@ -213,7 +204,6 @@ static void mq_walk(struct Qdisc *sch, struct qdisc_walker *arg)
 }
 
 static const struct Qdisc_class_ops mq_class_ops = {
-	.select_queue	= mq_select_queue,
 	.graft		= mq_graft,
 	.leaf		= mq_leaf,
 	.get		= mq_get,

^ permalink raw reply related

* Re: more troubles with bridge in netns
From: Daniel Lezcano @ 2009-09-14 11:28 UTC (permalink / raw)
  To: Atis Elsts; +Cc: netdev
In-Reply-To: <200909141419.12330.atis@mikrotik.com>

Atis Elsts wrote:
> On Tuesday 08 September 2009 11:40:44 Daniel Lezcano wrote:
>   
>> Atis Elsts wrote:
>>     
>>> Trying to add bridge interface from userspace program, after moving the
>>> program to a new network namespace, causes kernel to crash. I am using
>>> latest kernel version from git (2.6.31-rc9).
>>> The bug is easy to reproduce - just compile and run the attached C
>>> program.
>>>
>>> I see that bridge interface has NETIF_F_NETNS_LOCAL flag, but as I
>>> understand, this flag simply means that a device cannot be *moved* across
>>> network namespaces, not that it cannot be *created* in other namespaces.
>>>       
>> Yep, very easy to reproduce :/
>> The sysfs has not been disabled for the bridge. I will try to fix it as
>> soon as I can.
>>
>> Thanks
>>   -- Daniel
>>     
>
> Hello,
>
> please let me know when the sysfs patch for bridge is available. At the moment 
> I managed to get it to work by just commenting out all sysfs stuff for bridge 
> module. However, a new problem appears now. After running C program 
> (attached) that creates a bridge in network namespace and attaches an 
> interface to it, I got this message repeatedly:
>  kernel:[  466.758908] unregister_netdevice: waiting for lo to become free. 
> Usage count = 2
>
> It sems pretty unlikely that my kernel changes could have caused this?
>
> The unregister_netdevice message does not appear, however, if I uncomment this 
> line in child.c:
>     system("brctl setfd sim_br0 0");
>   

I was about to send a patch to disable the bridge per namespace as it 
seems it was never tested.
Can you send me your kernel patch ?

Thanks.
  -- Daniel

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox