Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH 3/3] IPv6: Generic TTL Security Mechanism (unified version)
From: Stephen Hemminger @ 2010-04-03 23:21 UTC (permalink / raw)
  To: davem, Pekka Savola, YOSHIFUJI Hideaki, Nick Hilliard; +Cc: netdev
In-Reply-To: <20100403232103.923025940@vyatta.com>

[-- Attachment #1: gtsm-ipv6.diff --]
[-- Type: text/plain, Size: 3126 bytes --]

This patch is one alternative IPv6 support for RFC5082 Generalized TTL
Security Mechanism. 

This version takes a simplest (but least pure) approach.
It uses the same socket option for IPv6 as IPv4 because
the TCP code has to deal with mapped addresses already.

With this method, the server doesn't have to deal with both IPv4 and IPv6
socket options. But the client still does have to handle the different
options.

On client:
	int ttl = 255;
	getaddrinfo(argv[1], argv[2], &hint, &result);

	for (rp = result; rp != NULL; rp = rp->ai_next) {
		s = socket(rp->ai_family, rp->ai_socktype, rp->ai_protocol);
		if (s < 0) continue;

		if (rp->ai_family == AF_INET) {
			setsockopt(s, IPPROTO_IP, IP_TTL, &ttl, sizeof(ttl));
		} else if (rp->ai_family == AF_INET6) {
			setsockopt(s, IPPROTO_IPV6,  IPV6_UNICAST_HOPS, 
					&ttl, sizeof(ttl)))
		}
			
		if (connect(s, rp->ai_addr, rp->ai_addrlen) == 0) {
		   ...

On server:
	unsigned char minttl = 255 - maxhops;
   
	getaddrinfo(NULL, port, &hints, &result);
	for (rp = result; rp != NULL; rp = rp->ai_next) {
		s = socket(rp->ai_family, rp->ai_socktype, rp->ai_protocol);
		if (s < 0) continue;

		setsockopt(s, IPPROTO_IP, IP_MINTTL, &minttl, sizeof(minttl));
			
		if (bind(s, rp->ai_addr, rp->ai_addrlen) == 0)
			break
..


Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

---
 net/ipv4/tcp_ipv4.c |   15 +++++++++++----
 net/ipv6/tcp_ipv6.c |   10 ++++++++++
 2 files changed, 21 insertions(+), 4 deletions(-)

--- a/net/ipv6/tcp_ipv6.c	2010-04-02 21:19:39.692013672 -0700
+++ b/net/ipv6/tcp_ipv6.c	2010-04-03 15:55:43.778224848 -0700
@@ -349,6 +349,11 @@ static void tcp_v6_err(struct sk_buff *s
 	if (sk->sk_state == TCP_CLOSE)
 		goto out;
 
+	if (ipv6_hdr(skb)->hop_limit < inet_sk(sk)->min_ttl) {
+		NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
+		goto out;
+	}
+
 	tp = tcp_sk(sk);
 	seq = ntohl(th->seq);
 	if (sk->sk_state != TCP_LISTEN &&
@@ -1717,6 +1722,11 @@ process:
 	if (sk->sk_state == TCP_TIME_WAIT)
 		goto do_time_wait;
 
+	if (ipv6_hdr(skb)->hop_limit < inet_sk(sk)->min_ttl) {
+		NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
+		goto discard_and_relse;
+	}
+
 	if (!xfrm6_policy_check(sk, XFRM_POLICY_IN, skb))
 		goto discard_and_relse;
 
--- a/net/ipv4/tcp_ipv4.c	2010-04-02 21:19:39.682014278 -0700
+++ b/net/ipv4/tcp_ipv4.c	2010-04-02 21:20:25.571077252 -0700
@@ -1660,10 +1660,14 @@ process:
 	if (sk->sk_state == TCP_TIME_WAIT)
 		goto do_time_wait;
 
-	if (unlikely(iph->ttl < inet_sk(sk)->min_ttl)) {
-		NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
-		goto discard_and_relse;
-	}
+#if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
+	if (skb->protocol == htons(ETH_P_IPV6)) {
+		if (ipv6_hdr(skb)->hop_limit < inet_sk(sk)->min_ttl)
+			goto min_ttl_discard;
+	} else
+#endif
+	if (iph->ttl < inet_sk(sk)->min_ttl)
+		goto min_ttl_discard;
 
 	if (!xfrm4_policy_check(sk, XFRM_POLICY_IN, skb))
 		goto discard_and_relse;
@@ -1716,6 +1720,9 @@ discard_it:
 	kfree_skb(skb);
 	return 0;
 
+min_ttl_discard:
+	NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
+
 discard_and_relse:
 	sock_put(sk);
 	goto discard_it;

-- 


^ permalink raw reply

* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Herbert Xu @ 2010-04-04  2:06 UTC (permalink / raw)
  To: Timo Teräs; +Cc: netdev
In-Reply-To: <4BB7A2B8.4040405@iki.fi>

On Sat, Apr 03, 2010 at 11:19:04PM +0300, Timo Teräs wrote:
>
> If someone is then removing a net driver, we still execute
> flush on the 'device down' hook, and all stale bundles
> get flushed.

Not if the bundle belongs to a policy recently deleted.

> But yes, this means that xfrm_policy struct can now be held
> allocated up to ten extra minutes. But it's only memory that
> it's holding, not any extra refs. And it's still reclaimable
> by the GC.

You also hold down the bundle xdst's along with it, which can
hold netdev references preventing modules from being unloaded.

> If this feels troublesome, we could add asynchronous flush
> request that would be called on policy removal. Or even stick
> to the synchronous one.

How about change xfrm_flush_bundles to flush bundles from the
cache instead of xfrm_policy?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Timo Teräs @ 2010-04-04  5:50 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev
In-Reply-To: <20100404020657.GA8520@gondor.apana.org.au>

Herbert Xu wrote:
> On Sat, Apr 03, 2010 at 11:19:04PM +0300, Timo Teräs wrote:
>> If someone is then removing a net driver, we still execute
>> flush on the 'device down' hook, and all stale bundles
>> get flushed.
> 
> Not if the bundle belongs to a policy recently deleted.
> 
>> But yes, this means that xfrm_policy struct can now be held
>> allocated up to ten extra minutes. But it's only memory that
>> it's holding, not any extra refs. And it's still reclaimable
>> by the GC.
> 
> You also hold down the bundle xdst's along with it, which can
> hold netdev references preventing modules from being unloaded.
> 
>> If this feels troublesome, we could add asynchronous flush
>> request that would be called on policy removal. Or even stick
>> to the synchronous one.
> 
> How about change xfrm_flush_bundles to flush bundles from the
> cache instead of xfrm_policy?

For the common case:

1. Policy deleted; policy->walk.dead set, policy->genid incremented
2. NETDEV_DOWN hook called, calls flow_cache_flush()
3. flow_cache_flush enumerates all policy and bundle refs
   in it's cache
4. for each bundle xfrm_bundle_check_fce() is called, which
   calls stale_bundle()
5. all bundles using stale policy, fail that check because
     xdst->policy_genid != xdst->pols[0]->genid
   (checked in xfrm_bundle_ok)
6. flow cache calls entry's ->delete which is dst_free for bundles
7. flow_cache_flush() returns

flow_cache_flush really frees the bundles in it on flush.

But now that I look my code again. Your statement is true for
per-socket bundles. They would not get deleted in this case.
I'll change NETDEV_DOWN to call garbage collect instead of
flow cache flush which will then also free the per-socket bundles.

^ permalink raw reply

* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Herbert Xu @ 2010-04-04  5:58 UTC (permalink / raw)
  To: Timo Teräs; +Cc: netdev
In-Reply-To: <4BB828B1.5090109@iki.fi>

On Sun, Apr 04, 2010 at 08:50:41AM +0300, Timo Teräs wrote:
>
> For the common case:
>
> 1. Policy deleted; policy->walk.dead set, policy->genid incremented
> 2. NETDEV_DOWN hook called, calls flow_cache_flush()
> 3. flow_cache_flush enumerates all policy and bundle refs
>   in it's cache
> 4. for each bundle xfrm_bundle_check_fce() is called, which
>   calls stale_bundle()
> 5. all bundles using stale policy, fail that check because
>     xdst->policy_genid != xdst->pols[0]->genid
>   (checked in xfrm_bundle_ok)
> 6. flow cache calls entry's ->delete which is dst_free for bundles
> 7. flow_cache_flush() returns

Ah, you're doing it in 2/4.  Can we please have each patch be
a self-contained unit? It should be possible to apply 1/4 and
have a resulting kernel that works properly without having to
apply the rest of your patches.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Timo Teräs @ 2010-04-04  6:07 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev
In-Reply-To: <20100404055830.GA9484@gondor.apana.org.au>

Herbert Xu wrote:
> On Sun, Apr 04, 2010 at 08:50:41AM +0300, Timo Teräs wrote:
>> For the common case:
>>
>> 1. Policy deleted; policy->walk.dead set, policy->genid incremented
>> 2. NETDEV_DOWN hook called, calls flow_cache_flush()
>> 3. flow_cache_flush enumerates all policy and bundle refs
>>   in it's cache
>> 4. for each bundle xfrm_bundle_check_fce() is called, which
>>   calls stale_bundle()
>> 5. all bundles using stale policy, fail that check because
>>     xdst->policy_genid != xdst->pols[0]->genid
>>   (checked in xfrm_bundle_ok)
>> 6. flow cache calls entry's ->delete which is dst_free for bundles
>> 7. flow_cache_flush() returns
> 
> Ah, you're doing it in 2/4.  Can we please have each patch be
> a self-contained unit? It should be possible to apply 1/4 and
> have a resulting kernel that works properly without having to
> apply the rest of your patches.

With 1/4 only, the bundle deletion is not touched. In that case
the policy GC deletes explicitly the bundles. The bundles get
deleted immediately, and only the struct xfrm_policy might get
held up allocated longer.

The code flow would be:
 1. xfrm_policy_kill() queues to GC
 2. xfrm_policy_gc_kill() called from xfrm_policy_gc_task()
    frees all bundles in that policy
 3. xfrm_policy_gc_kill() releases it's reference
 4. ... time passes (flush, randomization, or flow hit occurs)
 5. flow cache releases it's final reference, calls
    xfrm_policy_destroy() which only frees the xfrm_policy memory

^ permalink raw reply

* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Herbert Xu @ 2010-04-04  6:19 UTC (permalink / raw)
  To: Timo Teräs; +Cc: netdev
In-Reply-To: <4BB82C90.70809@iki.fi>

On Sun, Apr 04, 2010 at 09:07:12AM +0300, Timo Teräs wrote:
>
> With 1/4 only, the bundle deletion is not touched. In that case
> the policy GC deletes explicitly the bundles. The bundles get
> deleted immediately, and only the struct xfrm_policy might get
> held up allocated longer.
>
> The code flow would be:
> 1. xfrm_policy_kill() queues to GC
> 2. xfrm_policy_gc_kill() called from xfrm_policy_gc_task()
>    frees all bundles in that policy
> 3. xfrm_policy_gc_kill() releases it's reference
> 4. ... time passes (flush, randomization, or flow hit occurs)
> 5. flow cache releases it's final reference, calls
>    xfrm_policy_destroy() which only frees the xfrm_policy memory

With 1/4 only, you've removed the flow cache flush when policies
are deleted.  However, you don't add the flow cache flush to the
NETDEV_DOWN case until 2/4.  So with only 1/4, bundles (and hence
netdev objects) may be held for up to 10 minutes.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Timo Teräs @ 2010-04-04  6:28 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev
In-Reply-To: <20100404061941.GA9642@gondor.apana.org.au>

Herbert Xu wrote:
> On Sun, Apr 04, 2010 at 09:07:12AM +0300, Timo Teräs wrote:
>> With 1/4 only, the bundle deletion is not touched. In that case
>> the policy GC deletes explicitly the bundles. The bundles get
>> deleted immediately, and only the struct xfrm_policy might get
>> held up allocated longer.
>>
>> The code flow would be:
>> 1. xfrm_policy_kill() queues to GC
>> 2. xfrm_policy_gc_kill() called from xfrm_policy_gc_task()
>>    frees all bundles in that policy
>> 3. xfrm_policy_gc_kill() releases it's reference
>> 4. ... time passes (flush, randomization, or flow hit occurs)
>> 5. flow cache releases it's final reference, calls
>>    xfrm_policy_destroy() which only frees the xfrm_policy memory
> 
> With 1/4 only, you've removed the flow cache flush when policies
> are deleted.  However, you don't add the flow cache flush to the
> NETDEV_DOWN case until 2/4.  So with only 1/4, bundles (and hence
> netdev objects) may be held for up to 10 minutes.

No. The flow cache flush removal does not prevent bundle deletion.
The flow cache flush is in current code *after* deleting the bundles
from the policy. Freeing bundles and flushing cache are completely
two separate things in current code. Only in 2/4 the bundle deletion
becomes dependent on flow cache flush.

Please, read xfrm_policy_kill() and xfrm_policy_gc_kill() in current
code, and after applying 1/4. The diff of 1/4 is not as informative
as the bundle deletion code is not shown there (since it's not touched).

The only reason why current code does flow cache flush on policy
removal, is to make sure that it's not the flow cache atomic_dec
to drop last reference; if that happened xfrm_policy_destroy would
not get ever called.

^ permalink raw reply

* Re: [PATCHv3] drivers/net/usb: Add new driver ipheth
From: Oliver Neukum @ 2010-04-04  7:24 UTC (permalink / raw)
  To: L. Alberto Giménez
  Cc: linux-kernel, netdev, linux-usb, linville, j.dumon,
	steve.glendinning, davem, gregkh, dgiagio, dborca
In-Reply-To: <4BB63619.6070607@sysvalve.es>

Am Freitag, 2. April 2010 20:23:21 schrieb L. Alberto Giménez:
> On 03/31/2010 10:33 PM, Oliver Neukum wrote:
> > Am Mittwoch, 31. März 2010 21:42:07 schrieb L. Alberto Giménez:
> 
> Hi Oliver,
> 
> Just like with Ben's comments I still have a couple of doubts about your
> comments.
> 
> 
> >> +
> >> +static int ipheth_open(struct net_device *net)
> >> +{
> >> +	struct ipheth_device *dev = netdev_priv(net);
> >> +	struct usb_device *udev = dev->udev;
> >> +	int retval = 0;
> >> +
> >> +	usb_set_interface(udev, IPHETH_INTFNUM, IPHETH_ALT_INTFNUM);
> >> +	usb_clear_halt(udev, usb_rcvbulkpipe(udev, dev->bulk_in));
> >> +	usb_clear_halt(udev, usb_sndbulkpipe(udev, dev->bulk_out));
> > 
> > Is this really needed? If so, please add a comment.
> 
> I understand that usb_clear_halt is only needed when the device has
> transmitted data, and as it is "open" time, we can assume that no
> transmissions ere made, so we don't need to clear anything (aka: remove
> both lines), am I right?

Clearing a halt is necessary only when a device has stalled due to an
error condition. Unless the device is buggy and produces errors for
no good reason you don't need these lines.
 
> >> +
> >> +	retval = ipheth_carrier_set(dev);
> >> +	if (retval)
> >> +		goto error;
> >> +
> >> +	retval = ipheth_rx_submit(dev, GFP_KERNEL);
> >> +	if (retval)
> >> +		goto error;
> >> +
> >> +	schedule_delayed_work(&dev->carrier_work, IPHETH_CARRIER_CHECK_TIMEOUT);
> > 
> > Does it make sense to start rx while you have no carrier?
> 
> Well, I have no clue about this one. I think that upstream developers
> should take a look into this (Dario, Daniel, could you?) since I don't
> have the knowledge to decide what to do about it.
> 
> But I assume that as with the previous one, we have just opened the
> device and we aren't (yet) doing anything with it, so we shouldn't start rx?

Your code as is is correct, I just wondered whether it could be made more
efficient.

> >> +static void ipheth_disconnect(struct usb_interface *intf)
> >> +{
> >> +	struct ipheth_device *dev;
> >> +
> >> +	dev = usb_get_intfdata(intf);
> >> +	if (dev != NULL) {
> > 
> > is this check needed?
> 
> Does usb_get_infdata always return not NULL? I haven't found anything

It returns what you gave it with  usb_set_intfdata().

> about it (just manual pages for the function, but can't spot if it
> cannot return NULL). We disconnected the device, but I understand that
> the kernel still has the information and the allocated memory, so the
> cleanup code is still needed, isn't it?

It is definitely needed.

> >> +static struct usb_driver ipheth_driver = {
> >> +	.name =		"ipheth",
> >> +	.probe =	ipheth_probe,
> >> +	.disconnect =	ipheth_disconnect,
> >> +	.id_table =	ipheth_table,
> >> +	.supports_autosuspend = 0,
> > 
> > redundant
> 
> Why?

0 is the default.

	Regards
		Oliver

^ permalink raw reply

* Re: linux-next: build warning after merge of the tree
From: Stephen Rothwell @ 2010-04-04  7:26 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-next, linux-kernel, kyle
In-Reply-To: <20100328.185707.161194820.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 342 bytes --]

Hi Dave,

On Sun, 28 Mar 2010 18:57:07 -0700 (PDT) David Miller <davem@davemloft.net> wrote:
>
> Strange, it didn't warn here.
> 
> I just committed the following, let me know if it fixes
> this.

Yeah, that did it, thanks

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [PATCH net-next-2.6 v4 00/14] l2tp: Introduce L2TPv3 support
From: Eric Dumazet @ 2010-04-04  7:54 UTC (permalink / raw)
  To: David Miller; +Cc: jchapman, netdev
In-Reply-To: <20100403.142522.01040744.davem@davemloft.net>

Le samedi 03 avril 2010 à 14:25 -0700, David Miller a écrit :
> From: James Chapman <jchapman@katalix.com>
> Date: Fri, 02 Apr 2010 17:18:23 +0100
> 
> > This patch series adds L2TPv3 support. It splits the existing pppol2tp
> > driver to separate its L2TP and PPP parts, then adds new L2TPv3
> > functionality. The patches implement a new socket family for L2TPv3 IP
> > encapsulation, expose virtual netdevices for each L2TPv3 ethernet
> > pseudowire and add a netlink interface.
> 
> Ok I'm going to toss this into net-next-2.6, we have time to
> fixup any remaining issues people might discover.
> 

[PATCH net-next-2.6] l2tp: unmanaged L2TPv3 tunnels fixes

Followup to commit 789a4a2c 
(l2tp: Add support for static unmanaged L2TPv3 tunnels)

One missing init in l2tp_tunnel_sock_create() could access random kernel
memory, and a bit field should be unsigned.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/l2tp/l2tp_core.c |    2 +-
 net/l2tp/l2tp_core.h |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index 13ed85b..98dfcce 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -1227,7 +1227,7 @@ static int l2tp_tunnel_sock_create(u32 tunnel_id, u32 peer_tunnel_id, struct l2t
 	int err = -EINVAL;
 	struct sockaddr_in udp_addr;
 	struct sockaddr_l2tpip ip_addr;
-	struct socket *sock;
+	struct socket *sock = NULL;
 
 	switch (cfg->encap) {
 	case L2TP_ENCAPTYPE_UDP:
diff --git a/net/l2tp/l2tp_core.h b/net/l2tp/l2tp_core.h
index 91b1b9c..f0f318e 100644
--- a/net/l2tp/l2tp_core.h
+++ b/net/l2tp/l2tp_core.h
@@ -152,7 +152,7 @@ struct l2tp_tunnel_cfg {
 	struct in_addr		peer_ip;
 	u16			local_udp_port;
 	u16			peer_udp_port;
-	int			use_udp_checksums:1;
+	unsigned int		use_udp_checksums:1;
 };
 
 struct l2tp_tunnel {



^ permalink raw reply related

* Re: [PATCH net-next-2.6 v4 00/14] l2tp: Introduce L2TPv3 support
From: David Miller @ 2010-04-04  8:02 UTC (permalink / raw)
  To: eric.dumazet; +Cc: jchapman, netdev
In-Reply-To: <1270367668.1971.3.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 04 Apr 2010 09:54:28 +0200

> [PATCH net-next-2.6] l2tp: unmanaged L2TPv3 tunnels fixes
> 
> Followup to commit 789a4a2c 
> (l2tp: Add support for static unmanaged L2TPv3 tunnels)
> 
> One missing init in l2tp_tunnel_sock_create() could access random kernel
> memory, and a bit field should be unsigned.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied, thanks Eric.

^ permalink raw reply

* Re: [PATCH net-next-2.6 v4 00/14] l2tp: Introduce L2TPv3 support
From: Eric Dumazet @ 2010-04-04  8:14 UTC (permalink / raw)
  To: David Miller; +Cc: jchapman, netdev
In-Reply-To: <20100404.010259.161188935.davem@davemloft.net>

Le dimanche 04 avril 2010 à 01:02 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Sun, 04 Apr 2010 09:54:28 +0200
> 
> > [PATCH net-next-2.6] l2tp: unmanaged L2TPv3 tunnels fixes
> > 
> > Followup to commit 789a4a2c 
> > (l2tp: Add support for static unmanaged L2TPv3 tunnels)
> > 
> > One missing init in l2tp_tunnel_sock_create() could access random kernel
> > memory, and a bit field should be unsigned.
> > 
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> Applied, thanks Eric.

I am going to work on net/l2tp/l2tp_core.c, since RCU conversion is
wrong (but original code was wong too...)

Example :

There is no real protection in following code, since no refcount is
taken on session before releasing rcu_read_lock :

static struct l2tp_session *l2tp_session_find_2(struct net *net, u32 session_id)
{
        struct l2tp_net *pn = l2tp_pernet(net);
        struct hlist_head *session_list =
                l2tp_session_id_hash_2(pn, session_id);
        struct l2tp_session *session;
        struct hlist_node *walk;

        rcu_read_lock_bh();
        hlist_for_each_entry_rcu(session, walk, session_list, global_hlist) {
                if (session->session_id == session_id) {
                        rcu_read_unlock_bh();
                        return session;
                }
        }
        rcu_read_unlock_bh();

        return NULL;
}



^ permalink raw reply

* RE: [PATCH] bnx2x: use the dma state API instead of the pci equivalents
From: Vladislav Zolotarov @ 2010-04-04  8:19 UTC (permalink / raw)
  To: FUJITA Tomonori, netdev@vger.kernel.org; +Cc: Eilon Greenstein
In-Reply-To: <20100402115054A.fujita.tomonori@lab.ntt.co.jp>

Why is it preferable? As far as I can see the current patch is not going to introduce any functional change.

Is there a plan to remove pci_map_X()/pci_alloc_consistent() functions family in the future and completely replaced with dma_X() functions? Is it appropriate to use dma_X() in all the places where pci_X() is used? For instance, we do use DAC mode and as far as I understand we should use pci_X() interface in this case. Is this rule not relevant anymore?

So, if we don't need to use pci_X() interface anymore, lets replace pci_X() properly all over the bnx2x with dma_X() functions. And if not, this patch mixes the macros from one API (dma_X) and functions from another (pci_X()) which may hardly be called "preferable"...

Thanks,
vlad

> -----Original Message-----
> From: netdev-owner@vger.kernel.org 
> [mailto:netdev-owner@vger.kernel.org] On Behalf Of FUJITA Tomonori
> Sent: Friday, April 02, 2010 5:57 AM
> To: netdev@vger.kernel.org
> Cc: Eilon Greenstein
> Subject: [PATCH] bnx2x: use the dma state API instead of the 
> pci equivalents
> 
> The DMA API is preferred.
> 
> Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
> ---
>  drivers/net/bnx2x.h      |    4 ++--
>  drivers/net/bnx2x_main.c |   28 ++++++++++++++--------------
>  2 files changed, 16 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/net/bnx2x.h b/drivers/net/bnx2x.h
> index 3c48a7a..ae9c89e 100644
> --- a/drivers/net/bnx2x.h
> +++ b/drivers/net/bnx2x.h
> @@ -163,7 +163,7 @@ do {					
> 			 \
>  
>  struct sw_rx_bd {
>  	struct sk_buff	*skb;
> -	DECLARE_PCI_UNMAP_ADDR(mapping)
> +	DEFINE_DMA_UNMAP_ADDR(mapping);
>  };
>  
>  struct sw_tx_bd {
> @@ -176,7 +176,7 @@ struct sw_tx_bd {
>  
>  struct sw_rx_page {
>  	struct page	*page;
> -	DECLARE_PCI_UNMAP_ADDR(mapping)
> +	DEFINE_DMA_UNMAP_ADDR(mapping);
>  };
>  
>  union db_prod {
> diff --git a/drivers/net/bnx2x_main.c b/drivers/net/bnx2x_main.c
> index 6c042a7..2a77611 100644
> --- a/drivers/net/bnx2x_main.c
> +++ b/drivers/net/bnx2x_main.c
> @@ -1086,7 +1086,7 @@ static inline void 
> bnx2x_free_rx_sge(struct bnx2x *bp,
>  	if (!page)
>  		return;
>  
> -	pci_unmap_page(bp->pdev, pci_unmap_addr(sw_buf, mapping),
> +	pci_unmap_page(bp->pdev, dma_unmap_addr(sw_buf, mapping),
>  		       SGE_PAGE_SIZE*PAGES_PER_SGE, PCI_DMA_FROMDEVICE);
>  	__free_pages(page, PAGES_PER_SGE_SHIFT);
>  
> @@ -1123,7 +1123,7 @@ static inline int 
> bnx2x_alloc_rx_sge(struct bnx2x *bp,
>  	}
>  
>  	sw_buf->page = page;
> -	pci_unmap_addr_set(sw_buf, mapping, mapping);
> +	dma_unmap_addr_set(sw_buf, mapping, mapping);
>  
>  	sge->addr_hi = cpu_to_le32(U64_HI(mapping));
>  	sge->addr_lo = cpu_to_le32(U64_LO(mapping));
> @@ -1151,7 +1151,7 @@ static inline int 
> bnx2x_alloc_rx_skb(struct bnx2x *bp,
>  	}
>  
>  	rx_buf->skb = skb;
> -	pci_unmap_addr_set(rx_buf, mapping, mapping);
> +	dma_unmap_addr_set(rx_buf, mapping, mapping);
>  
>  	rx_bd->addr_hi = cpu_to_le32(U64_HI(mapping));
>  	rx_bd->addr_lo = cpu_to_le32(U64_LO(mapping));
> @@ -1174,12 +1174,12 @@ static void bnx2x_reuse_rx_skb(struct 
> bnx2x_fastpath *fp,
>  	struct eth_rx_bd *prod_bd = &fp->rx_desc_ring[prod];
>  
>  	pci_dma_sync_single_for_device(bp->pdev,
> -				       
> pci_unmap_addr(cons_rx_buf, mapping),
> +				       
> dma_unmap_addr(cons_rx_buf, mapping),
>  				       RX_COPY_THRESH, 
> PCI_DMA_FROMDEVICE);
>  
>  	prod_rx_buf->skb = cons_rx_buf->skb;
> -	pci_unmap_addr_set(prod_rx_buf, mapping,
> -			   pci_unmap_addr(cons_rx_buf, mapping));
> +	dma_unmap_addr_set(prod_rx_buf, mapping,
> +			   dma_unmap_addr(cons_rx_buf, mapping));
>  	*prod_bd = *cons_bd;
>  }
>  
> @@ -1285,7 +1285,7 @@ static void bnx2x_tpa_start(struct 
> bnx2x_fastpath *fp, u16 queue,
>  	prod_rx_buf->skb = fp->tpa_pool[queue].skb;
>  	mapping = pci_map_single(bp->pdev, 
> fp->tpa_pool[queue].skb->data,
>  				 bp->rx_buf_size, PCI_DMA_FROMDEVICE);
> -	pci_unmap_addr_set(prod_rx_buf, mapping, mapping);
> +	dma_unmap_addr_set(prod_rx_buf, mapping, mapping);
>  
>  	/* move partial skb from cons to pool (don't unmap yet) */
>  	fp->tpa_pool[queue] = *cons_rx_buf;
> @@ -1361,7 +1361,7 @@ static int bnx2x_fill_frag_skb(struct 
> bnx2x *bp, struct bnx2x_fastpath *fp,
>  		}
>  
>  		/* Unmap the page as we r going to pass it to 
> the stack */
> -		pci_unmap_page(bp->pdev, 
> pci_unmap_addr(&old_rx_pg, mapping),
> +		pci_unmap_page(bp->pdev, 
> dma_unmap_addr(&old_rx_pg, mapping),
>  			      SGE_PAGE_SIZE*PAGES_PER_SGE, 
> PCI_DMA_FROMDEVICE);
>  
>  		/* Add one frag and update the appropriate 
> fields in the skb */
> @@ -1389,7 +1389,7 @@ static void bnx2x_tpa_stop(struct bnx2x 
> *bp, struct bnx2x_fastpath *fp,
>  	/* Unmap skb in the pool anyway, as we are going to change
>  	   pool entry status to BNX2X_TPA_STOP even if new skb 
> allocation
>  	   fails. */
> -	pci_unmap_single(bp->pdev, pci_unmap_addr(rx_buf, mapping),
> +	pci_unmap_single(bp->pdev, dma_unmap_addr(rx_buf, mapping),
>  			 bp->rx_buf_size, PCI_DMA_FROMDEVICE);
>  
>  	if (likely(new_skb)) {
> @@ -1621,7 +1621,7 @@ static int bnx2x_rx_int(struct 
> bnx2x_fastpath *fp, int budget)
>  			}
>  
>  			pci_dma_sync_single_for_device(bp->pdev,
> -					pci_unmap_addr(rx_buf, mapping),
> +					dma_unmap_addr(rx_buf, mapping),
>  						       pad + 
> RX_COPY_THRESH,
>  						       
> PCI_DMA_FROMDEVICE);
>  			prefetch(skb);
> @@ -1666,7 +1666,7 @@ static int bnx2x_rx_int(struct 
> bnx2x_fastpath *fp, int budget)
>  			} else
>  			if (likely(bnx2x_alloc_rx_skb(bp, fp, 
> bd_prod) == 0)) {
>  				pci_unmap_single(bp->pdev,
> -					pci_unmap_addr(rx_buf, mapping),
> +					dma_unmap_addr(rx_buf, mapping),
>  						 bp->rx_buf_size,
>  						 PCI_DMA_FROMDEVICE);
>  				skb_reserve(skb, pad);
> @@ -4941,7 +4941,7 @@ static inline void 
> bnx2x_free_tpa_pool(struct bnx2x *bp,
>  
>  		if (fp->tpa_state[i] == BNX2X_TPA_START)
>  			pci_unmap_single(bp->pdev,
> -					 pci_unmap_addr(rx_buf, 
> mapping),
> +					 dma_unmap_addr(rx_buf, 
> mapping),
>  					 bp->rx_buf_size, 
> PCI_DMA_FROMDEVICE);
>  
>  		dev_kfree_skb(skb);
> @@ -4978,7 +4978,7 @@ static void bnx2x_init_rx_rings(struct 
> bnx2x *bp)
>  					fp->disable_tpa = 1;
>  					break;
>  				}
> -				pci_unmap_addr_set((struct sw_rx_bd *)
> +				dma_unmap_addr_set((struct sw_rx_bd *)
>  							
> &bp->fp->tpa_pool[i],
>  						   mapping, 0);
>  				fp->tpa_state[i] = BNX2X_TPA_STOP;
> @@ -6907,7 +6907,7 @@ static void bnx2x_free_rx_skbs(struct bnx2x *bp)
>  				continue;
>  
>  			pci_unmap_single(bp->pdev,
> -					 pci_unmap_addr(rx_buf, 
> mapping),
> +					 dma_unmap_addr(rx_buf, 
> mapping),
>  					 bp->rx_buf_size, 
> PCI_DMA_FROMDEVICE);
>  
>  			rx_buf->skb = NULL;
> -- 
> 1.7.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply

* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Herbert Xu @ 2010-04-04  8:35 UTC (permalink / raw)
  To: Timo Teräs; +Cc: netdev
In-Reply-To: <4BB8319B.3040209@iki.fi>

On Sun, Apr 04, 2010 at 09:28:43AM +0300, Timo Teräs wrote:
>
> No. The flow cache flush removal does not prevent bundle deletion.
> The flow cache flush is in current code *after* deleting the bundles
> from the policy. Freeing bundles and flushing cache are completely
> two separate things in current code. Only in 2/4 the bundle deletion
> becomes dependent on flow cache flush.

Ah yes I confused myself.  The problem I thought would occur
doesn't because 1/4 doesn't put the bundles in the flow cache
yet.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH] bnx2x: use the dma state API instead of the pci equivalents
From: David Miller @ 2010-04-04  8:39 UTC (permalink / raw)
  To: vladz; +Cc: fujita.tomonori, netdev, eilong
In-Reply-To: <8628FE4E7912BF47A96AE7DD7BAC0AADDDC525ADDB@SJEXCHCCR02.corp.ad.broadcom.com>

From: "Vladislav Zolotarov" <vladz@broadcom.com>
Date: Sun, 4 Apr 2010 01:19:07 -0700

> So, if we don't need to use pci_X() interface anymore, lets replace
> pci_X() properly all over the bnx2x with dma_X() functions.

I think Fujita's plan of gradual and partial transformations is
legitimate, and his changes shouldn't be rejected because he simply
isn't modifying all of the interfaces used by this driver but rather
just a specific subset he is trying to transform across the tree.

Please rescind your objections.

Thanks.

^ permalink raw reply

* Re: [PATCH v2] Add Mergeable RX buffer feature to vhost_net
From: Michael S. Tsirkin @ 2010-04-04  8:55 UTC (permalink / raw)
  To: David Stevens; +Cc: kvm, kvm-owner, netdev, rusty, virtualization
In-Reply-To: <OF9AF90021.62EF0EE4-ON882576F8.00618B3E-882576F8.0064F223@us.ibm.com>

On Thu, Apr 01, 2010 at 11:22:37AM -0700, David Stevens wrote:
> kvm-owner@vger.kernel.org wrote on 04/01/2010 03:54:15 AM:
> 
> > On Wed, Mar 31, 2010 at 03:04:43PM -0700, David Stevens wrote:
> 
> > > 
> > > > > +               head.iov_base = (void 
> *)vhost_get_vq_desc(&net->dev, 
> > > vq,
> > > > > +                       vq->iov, ARRAY_SIZE(vq->iov), &out, &in, 
> NULL, 
> > > 
> > > > > NULL);
> > > > 
> > > > I this casting confusing.
> > > > Is it really expensive to add an array of heads so that
> > > > we do not need to cast?
> > > 
> > >         It needs the heads and the lengths, which looks a lot
> > > like an iovec. I was trying to resist adding a new
> > > struct XXX { unsigned head; unsigned len; } just for this,
> > > but I could make these parallel arrays, one with head index and
> > > the other with length.
> 
>         Michael, on this one, if I add vq->heads as an argument to
> vhost_get_heads (aka vhost_get_desc_n), I'd need the length too.
> Would you rather this 1) remain an iovec (and a single arg added) but
> cast still there, 2) 2 arrays (head and length) and 2 args added, or
> 3) a new struct type of {unsigned,int} to carry for the heads+len
> instead of iovec?
>         My preference would be 1). I agree the casts are ugly, but
> it is essentially an iovec the way we use it; it's just that the
> base isn't a pointer but a descriptor index instead.

I prefer 2 or 3. If you prefer 1 strongly, I think we should
add a detailed comment near the iovec, and
a couple of inline wrappers to store/get data in the iovec.

> > > 
> > >         EAGAIN is not possible after the change, because we don't
> > > even enter the loop unless we have an skb on the read queue; the
> > > other cases bomb out, so I figured the comment for future work is
> > > now done. :-)
> > 
> > Guest could be buggy so we'll get EFAULT.
> > If skb is taken off the rx queue (as below), we might get EAGAIN.
> 
>         We break on any error. If we get EAGAIN because someone read
> on the socket, this code would break the loop, but EAGAIN is a more
> serious problem if it changed since we peeked (because it means
> someone else is reading the socket).
>         But I don't understand -- are you suggesting that the error
> handling be different than that, or that the comment is still
> relevant?
> My intention here is to do the "TODO" from the comment
> so that it can be removed, by handling all error cases. I think
> because of the peek, EAGAIN isn't something to be ignored anymore,
> but the effect is the same whether we break out of the loop or
> not, since we retry the packet next time around. Essentially, we
> ignore every error since we will redo it with the same packet the
> next time around. Maybe we should print something here, but since
> we'll be retrying the packet that's still on the socket, a permanent
> error would spew continuously. Maybe we should shut down entirely
> if we get any negative return value here (including EAGAIN, since
> that tells us someone messed with the socket when we don't want them
> to).
>         If you want the comment still there, ok, but I do think EAGAIN
> isn't a special case per the comment anymore, and is handled as all
> other errors are: by exiting the loop and retrying next time.
> 
>                                                                 +-DLS

Yes, I just think some comment should stay, as you say, because
otherwise we simply retry continuously. Maybe we should trigger vq_err.
It needs to be given some thought which I have not given it yet.

Thinking aloud, EAGAIN means someone reads the socket
together with us, I prefer that this condition is made a fatal
error, we should make sure we are polling the socket
so we see packets if more appear.

-- 
MST

^ permalink raw reply

* RE: [PATCH] bnx2x: use the dma state API instead of the pci equivalents
From: Vladislav Zolotarov @ 2010-04-04  9:15 UTC (permalink / raw)
  To: David Miller
  Cc: fujita.tomonori@lab.ntt.co.jp, netdev@vger.kernel.org,
	Eilon Greenstein
In-Reply-To: <20100404.013920.120463402.davem@davemloft.net>

According to the changes in a PCI-DMA-mapping.txt it sounds like the trend is to stop using the pci_dma_* API and start using the dma_* API instead. Does this mean that using the pci_dma_* API is deprecated?

Thanks,
vlad

> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net] 
> Sent: Sunday, April 04, 2010 11:39 AM
> To: Vladislav Zolotarov
> Cc: fujita.tomonori@lab.ntt.co.jp; netdev@vger.kernel.org; 
> Eilon Greenstein
> Subject: Re: [PATCH] bnx2x: use the dma state API instead of 
> the pci equivalents
> 
> From: "Vladislav Zolotarov" <vladz@broadcom.com>
> Date: Sun, 4 Apr 2010 01:19:07 -0700
> 
> > So, if we don't need to use pci_X() interface anymore, lets replace
> > pci_X() properly all over the bnx2x with dma_X() functions.
> 
> I think Fujita's plan of gradual and partial transformations is
> legitimate, and his changes shouldn't be rejected because he simply
> isn't modifying all of the interfaces used by this driver but rather
> just a specific subset he is trying to transform across the tree.
> 
> Please rescind your objections.
> 
> Thanks.
> 
> 

^ permalink raw reply

* Re: [BUG] latest net-next-2.6 doesnt fly
From: FUJITA Tomonori @ 2010-04-04  9:16 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, davem, fujita.tomonori
In-Reply-To: <1270202304.1989.14.camel@edumazet-laptop>

On Fri, 02 Apr 2010 11:58:24 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> diff --git a/net/core/dev.c b/net/core/dev.c
> index e19cdae..c6b5206 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1801,7 +1801,7 @@ EXPORT_SYMBOL(netdev_rx_csum_fault);
>   * 2. No high memory really exists on this machine.
>   */
>  
> -static inline int illegal_highdma(struct net_device *dev, struct sk_buff *skb)
> +static int illegal_highdma(struct net_device *dev, struct sk_buff *skb)
>  {
>  #ifdef CONFIG_HIGHMEM
>  	int i;
> @@ -1814,6 +1814,8 @@ static inline int illegal_highdma(struct net_device *dev, struct sk_buff *skb)
>  	if (PCI_DMA_BUS_IS_PHYS) {
>  		struct device *pdev = dev->dev.parent;
>  
> +		if (!pdev)
> +			return 0;

Sorry about that and thanks for the fix.

I think, if pdev is null, returning 1 here is safer since the device
doesn't set up dma info properly.

Do you know what device hits this bug? You said that you use bnx2 and
tg3. Both call SET_NETDEV_DEV with pdev->dev. I tested bnx2 and seems
that netdev->dev.parent is set up correctly.

^ permalink raw reply

* Re: [BUG] latest net-next-2.6 doesnt fly
From: Eric Dumazet @ 2010-04-04  9:29 UTC (permalink / raw)
  To: FUJITA Tomonori; +Cc: netdev, davem
In-Reply-To: <20100404181542N.fujita.tomonori@lab.ntt.co.jp>

Le dimanche 04 avril 2010 à 18:16 +0900, FUJITA Tomonori a écrit :
> > +			return 0;
> 
> Sorry about that and thanks for the fix.
> 
> I think, if pdev is null, returning 1 here is safer since the device
> doesn't set up dma info properly.
> 
> Do you know what device hits this bug? You said that you use bnx2 and
> tg3. Both call SET_NETDEV_DEV with pdev->dev. I tested bnx2 and seems
> that netdev->dev.parent is set up correctly.
> --

Might be because of my setup, I suspect I had two reasons to hit the
bug :

A bonding of eth2 (bnx2) and eth3 (tg3)

Then vlans on top of this bond0

When first dev_queue_xmit() was called, it was for a virtual device :)

# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
qlen 1000
    link/ether 00:1e:0b:ec:d3:dc brd ff:ff:ff:ff:ff:ff
3: eth1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq
master bond0 state UP qlen 1000
    link/ether 00:1e:0b:ec:d3:d2 brd ff:ff:ff:ff:ff:ff
4: eth2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc hfsc
master bond0 state UP qlen 1000
    link/ether 00:1e:0b:ec:d3:d2 brd ff:ff:ff:ff:ff:ff
5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 00:1e:0b:92:78:51 brd ff:ff:ff:ff:ff:ff
6: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc
noqueue state UP 
    link/ether 00:1e:0b:ec:d3:d2 brd ff:ff:ff:ff:ff:ff
7: vlan.103@bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500
qdisc pfifo_fast state UP qlen 100
    link/ether 00:1e:0b:ec:d3:d2 brd ff:ff:ff:ff:ff:ff
8: vlan.825@bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500
qdisc pfifo_fast state UP qlen 1000
    link/ether 00:1e:0b:ec:d3:d2 brd ff:ff:ff:ff:ff:ff



^ permalink raw reply

* RE: [PATCH] bnx2x: use the dma state API instead of the pci equivalents
From: FUJITA Tomonori @ 2010-04-04 10:03 UTC (permalink / raw)
  To: vladz; +Cc: davem, fujita.tomonori, netdev, eilong
In-Reply-To: <8628FE4E7912BF47A96AE7DD7BAC0AADDDC525ADDC@SJEXCHCCR02.corp.ad.broadcom.com>

On Sun, 4 Apr 2010 02:15:52 -0700
"Vladislav Zolotarov" <vladz@broadcom.com> wrote:

> According to the changes in a PCI-DMA-mapping.txt it sounds like the
> trend is to stop using the pci_dma_* API and start using the dma_*
> API instead. Does this mean that using the pci_dma_* API is
> deprecated?

Sorry that I didn't put the enough information in the patch
description.

In the long term, I want to remove the pci_dma_* API.

We had the various bus-specific DMA API (pci, sbus, etc). It was the
headache for driver writers that handle multiple bus devices. So we
invented the generic DMA API long ago. Now we have only two bus
specific APIs: pci and ssb so I want to remove them and make sure that
driver writers are always able to use the generic DMA API with any
bus.

http://lwn.net/Articles/374137/

I don't plan to convert the whole tree to use the DMA API over the PCI
DMA API all together. I convert the drivers gradually. I already
removed the PCI DMA API from the docs under Documentation/. It would
be nice if some driver maintainers (or others) convert their drivers.

The patchset convert only the pci state API (such as
DECLARE_PCI_UNMAP_ADDR) because akpm complained of the API and I
promised him to clean up it:

http://lkml.org/lkml/2010/2/12/415

I'll send a patch if you like me to convert bnx2x to use the the DMA
API completely.

^ permalink raw reply

* Re: [BUG] latest net-next-2.6 doesnt fly
From: FUJITA Tomonori @ 2010-04-04 10:19 UTC (permalink / raw)
  To: eric.dumazet; +Cc: fujita.tomonori, netdev, davem
In-Reply-To: <1270373395.1971.13.camel@edumazet-laptop>

On Sun, 04 Apr 2010 11:29:55 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> > I think, if pdev is null, returning 1 here is safer since the device
> > doesn't set up dma info properly.
> > 
> > Do you know what device hits this bug? You said that you use bnx2 and
> > tg3. Both call SET_NETDEV_DEV with pdev->dev. I tested bnx2 and seems
> > that netdev->dev.parent is set up correctly.
> > --
> 
> Might be because of my setup, I suspect I had two reasons to hit the
> bug :
> 
> A bonding of eth2 (bnx2) and eth3 (tg3)
> 
> Then vlans on top of this bond0
> 
> When first dev_queue_xmit() was called, it was for a virtual device :)

Thanks! So it's due to bond or vlan (or both).

I guess that returning zero here with a null pdev is fine. If we
return 1, probably some people would complain about performance
regression. Like the block layer does, coping the dma restriction info
from the lower devices can solve this problem but I guess that it's
over engineering. My original patch doesn't loosen the DMA restriction
checking so returning zero shouldn't break anything. If when we fix
the usage of NETIF_F_HIGHDMA in each driver and also check the usage
of netdev->dev.parent, everything should be fine.

^ permalink raw reply

* RE: [PATCH] bnx2x: use the dma state API instead of the pci equivalents
From: Vladislav Zolotarov @ 2010-04-04 10:24 UTC (permalink / raw)
  To: FUJITA Tomonori
  Cc: davem@davemloft.net, netdev@vger.kernel.org, Eilon Greenstein
In-Reply-To: <20100404190116N.fujita.tomonori@lab.ntt.co.jp>

Ok. Got it now. Thanks, Fujita. I think we should patch the bnx2x to use the generic model (not just the mapping macros).

One last question: since which kernel version the generic DMA layer may be used instead of PCI DMA layer?

Thanks,
vlad

> -----Original Message-----
> From: netdev-owner@vger.kernel.org 
> [mailto:netdev-owner@vger.kernel.org] On Behalf Of FUJITA Tomonori
> Sent: Sunday, April 04, 2010 1:03 PM
> To: Vladislav Zolotarov
> Cc: davem@davemloft.net; fujita.tomonori@lab.ntt.co.jp; 
> netdev@vger.kernel.org; Eilon Greenstein
> Subject: RE: [PATCH] bnx2x: use the dma state API instead of 
> the pci equivalents
> 
> On Sun, 4 Apr 2010 02:15:52 -0700
> "Vladislav Zolotarov" <vladz@broadcom.com> wrote:
> 
> > According to the changes in a PCI-DMA-mapping.txt it sounds like the
> > trend is to stop using the pci_dma_* API and start using the dma_*
> > API instead. Does this mean that using the pci_dma_* API is
> > deprecated?
> 
> Sorry that I didn't put the enough information in the patch
> description.
> 
> In the long term, I want to remove the pci_dma_* API.
> 
> We had the various bus-specific DMA API (pci, sbus, etc). It was the
> headache for driver writers that handle multiple bus devices. So we
> invented the generic DMA API long ago. Now we have only two bus
> specific APIs: pci and ssb so I want to remove them and make sure that
> driver writers are always able to use the generic DMA API with any
> bus.
> 
> http://lwn.net/Articles/374137/
> 
> 
> I don't plan to convert the whole tree to use the DMA API over the PCI
> DMA API all together. I convert the drivers gradually. I already
> removed the PCI DMA API from the docs under Documentation/. It would
> be nice if some driver maintainers (or others) convert their drivers.
> 
> The patchset convert only the pci state API (such as
> DECLARE_PCI_UNMAP_ADDR) because akpm complained of the API and I
> promised him to clean up it:
> 
> http://lkml.org/lkml/2010/2/12/415
> 
> 
> I'll send a patch if you like me to convert bnx2x to use the the DMA
> API completely.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply

* Re: UDP path MTU discovery
From: Andi Kleen @ 2010-04-04 10:25 UTC (permalink / raw)
  To: Glen Turner; +Cc: Andi Kleen, Rick Jones, netdev
In-Reply-To: <1270186918.2119.27.camel@ilion>

On Fri, Apr 02, 2010 at 04:11:58PM +1030, Glen Turner wrote:
> On Thu, 2010-04-01 at 02:55 +0200, Andi Kleen wrote:
> > > What we need is an API for an instant notification that a ICMP Packet
> > > Too Big message has arrived concerning the socket.
> > 
> > That already exists of course: IP_RECVERR
> 
> Hi Andi,
> 
> So what should I code?  The suggested EMSGSIZE or your suggestion
> of grabbing all returning ICMP and parsing it?  Noting that the

You don't need to parse any ICMPs, the kernel does that for you. 
See the documentation of IP_RECVERR in ip(7). The MTU is in ee_info

First you need to enable path mtu discovery for the socket
using IP_MTU_DISCOVER.

So you can either keep track of the MTU yourself based on extended
errors coming out of IP_RECVERR, or ask the kernel using IP_MTU when
the socket is connected or simply lower when you see a EMSGSIZE. It's also 
possible to do this with a dummy socket that gets connected/unconnected too.

> second choice is pretty ugly.  That both seem specific to Linux is
> frustrating, but that is life -- adding support for an operating
> system seems to inevitably add #ifdefs for this sort of code.

Well when the other OS see the need they will hopefully add similar
interfaces, with some luck even compatible to the ones in Linux.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Herbert Xu @ 2010-04-04 10:42 UTC (permalink / raw)
  To: Timo Teras; +Cc: netdev
In-Reply-To: <1270126340-30181-2-git-send-email-timo.teras@iki.fi>

On Thu, Apr 01, 2010 at 03:52:17PM +0300, Timo Teras wrote:
>
> -extern void *flow_cache_lookup(struct net *net, struct flowi *key, u16 family,
> -			       u8 dir, flow_resolve_t resolver);
> +struct flow_cache_entry_ops {
> +	struct flow_cache_entry_ops ** (*get)(struct flow_cache_entry_ops **);
> +	int (*check)(struct flow_cache_entry_ops **);
> +	void (*delete)(struct flow_cache_entry_ops **);
> +};
> +
> +typedef struct flow_cache_entry_ops **(*flow_resolve_t)(
> +		struct net *net, struct flowi *key, u16 family,
> +		u8 dir, struct flow_cache_entry_ops **old_ops, void *ctx);

OK this bit really bugs me.

When I first looked at it, my reaction was why on earth are we
returning an ops pointer? Only after some digging around do I see
the fact that this ops pointer is in fact embedded in xfrm_policy.

How about embedding flow_cache_entry in xfrm_policy instead? Returning
flow_cache_entry * would make a lot more sense than a nested pointer
to flow_cache_entry_ops.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH 1/4] flow: virtualize flow cache entry methods
From: Timo Teräs @ 2010-04-04 10:50 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev
In-Reply-To: <20100404104230.GA10368@gondor.apana.org.au>

Herbert Xu wrote:
> On Thu, Apr 01, 2010 at 03:52:17PM +0300, Timo Teras wrote:
>> -extern void *flow_cache_lookup(struct net *net, struct flowi *key, u16 family,
>> -			       u8 dir, flow_resolve_t resolver);
>> +struct flow_cache_entry_ops {
>> +	struct flow_cache_entry_ops ** (*get)(struct flow_cache_entry_ops **);
>> +	int (*check)(struct flow_cache_entry_ops **);
>> +	void (*delete)(struct flow_cache_entry_ops **);
>> +};
>> +
>> +typedef struct flow_cache_entry_ops **(*flow_resolve_t)(
>> +		struct net *net, struct flowi *key, u16 family,
>> +		u8 dir, struct flow_cache_entry_ops **old_ops, void *ctx);
> 
> OK this bit really bugs me.
> 
> When I first looked at it, my reaction was why on earth are we
> returning an ops pointer? Only after some digging around do I see
> the fact that this ops pointer is in fact embedded in xfrm_policy.
> 
> How about embedding flow_cache_entry in xfrm_policy instead? Returning
> flow_cache_entry * would make a lot more sense than a nested pointer
> to flow_cache_entry_ops.

Because flow_cache_entry is per-cpu, and multiple entries (due to
different flows matching same policies, or same flow having multiple
per-cpu entries) can point to same policy. If we cached "dummy" objects
for even policies, then this would be better approach.

This would make actually sense, since it'd be useful to cache all
policies involved in check path (main + sub policy refs). In which
case we might want to make the ops 'per flow cache instance' instead
of 'per cache entry'.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox