Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: GSO and IPv4 forwarding
From: Eric Dumazet @ 2010-10-26 20:15 UTC (permalink / raw)
  To: Kevin Wilson; +Cc: netdev
In-Reply-To: <AANLkTikxNYGFAmyZDit9fTp=-M8wfYp3Ew8rejsjHWHb@mail.gmail.com>

Le mardi 26 octobre 2010 à 22:02 +0200, Kevin Wilson a écrit :
> Hi,
> When we set a netdevice to support forwarding, we disable LRO.
> This is done because we don't want to forward an SKB which has been
> processed by LRO.
> 
> This is done in inet_forward_change() in net/ipv4/devinet.c:
> We call dev_disable_lro(dev) in this method, when setting forwarding
> for the device.
> As a result, in ip_forward(), the packet will be dropped. (because
> skb_warn_if_lro(), called by this method,  returns TRUE)
> 
> My question is:
> dev_disable_lro(dev) disable the LRO feature (NETIF_F_LRO) of the
> device. But suppose I have a device where GRO is enabled (and LRO is
> not). And let's say I set forwarding on this device.
> 
> it seems to me that is such case, calling dev_disable_lro(dev)  in
> net_forward_change() to disable the LRO feature of the device (which
> is already disabled) is not enough, and in such case , GRO packets,
> which want to be forwarded,  will **not** be dropped in ip_forward().
> (since kb_warn_if_lro() will return false in this case)
> 
> Is it so ? I am ready to send a patch fixing it, but I am a newbie in
> kernel, so I want to ask first.

GRO packets can be forwarded just fine.




^ permalink raw reply

* Re: GSO and IPv4 forwarding
From: David Miller @ 2010-10-26 20:19 UTC (permalink / raw)
  To: wkevils; +Cc: netdev
In-Reply-To: <AANLkTikxNYGFAmyZDit9fTp=-M8wfYp3Ew8rejsjHWHb@mail.gmail.com>

From: Kevin Wilson <wkevils@gmail.com>
Date: Tue, 26 Oct 2010 22:02:26 +0200

> My question is:
> dev_disable_lro(dev) disable the LRO feature (NETIF_F_LRO) of the
> device. But suppose I have a device where GRO is enabled (and LRO is
> not). And let's say I set forwarding on this device.

GRO is completely different from LRO, and can remain enabled
when forwarding is turned on.

^ permalink raw reply

* Re: Unwanted aliasing of UDP checksum failed error counter
From: Jeremy Jackson @ 2010-10-26 20:21 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jeremy Jackson, netdev
In-Reply-To: <1288124033.2652.24.camel@edumazet-laptop>

> Le mardi 26 octobre 2010 Ã  15:53 -0400, Jeremy Jackson a Ã©crit :
>> Trying to find source of packet loss on an 8node compute cluster, we
>> find:
>> (not in this example, but on the real cluster)
>>
>> in /proc/sys/net/snmp
>> Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors
>> Udp: 976460 1750 0 986795 0 0
>>
>> InErrors *and* RcvbufErrors both go up with full socket buffer, this has
>> made troubleshooting our application more difficult.  We were chasing
>> UDP
>> checksum problems, until we checked linux source code, and found
>> aliasing.
>>
>> Is this done for assembly code efficiency?  Any reason ENOMEM (ie socket
>> buffer full) can't avoid aliasing to UDP checksum failed errors?
>>
>> in linux-source-2.6.32/net/ipv4/udp.c:__udp_queue_rcv_skb()
>> ....
>>                 /* Note that an ENOMEM error is charged twice */
>>                 if (rc == -ENOMEM) {
>>                         UDP_INC_STATS_BH(sock_net(sk),
>> UDP_MIB_RCVBUFERRORS,
>>                                          is_udplite);
>>                         atomic_inc(&sk->sk_drops);
>>                 }
>>                 goto drop;
>> ...
>> drop:
>>         UDP_INC_STATS_BH(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
>>
>
> In MIBS, there is no counter for UDP checksum errors
>
> So we use the standard UDP_MIB_INERRORS

Yes, this part I understand, but what I don't understand is why ENOMEM
errors *and* checksum errors both use the same counter, while ENOMEM has
it's own already.

> udpInErrors OBJECT-TYPE
>     SYNTAX     Counter32
>     MAX-ACCESS read-only
>     STATUS     current
>     DESCRIPTION
>            "The number of received UDP datagrams that could not be
>             delivered for reasons other than the lack of an
>             application at the destination port.
>
>
> We could add a LINUX specific MIB entry, eventually...
>
>
>
>



^ permalink raw reply

* Re: GSO and IPv4 forwarding
From: Kevin Wilson @ 2010-10-26 20:26 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1288124114.2652.25.camel@edumazet-laptop>

Hi,
Thanks a lot for your quick answer, I appreciate it (and did not
expect it to be so quick!)

Can someone please explain in 2-3 short sentences Why GRO can be
forwarded and LRO cannot be forwarded ?
rgs,
Kevin

On Tue, Oct 26, 2010 at 10:15 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 26 octobre 2010 à 22:02 +0200, Kevin Wilson a écrit :
>> Hi,
>> When we set a netdevice to support forwarding, we disable LRO.
>> This is done because we don't want to forward an SKB which has been
>> processed by LRO.
>>
>> This is done in inet_forward_change() in net/ipv4/devinet.c:
>> We call dev_disable_lro(dev) in this method, when setting forwarding
>> for the device.
>> As a result, in ip_forward(), the packet will be dropped. (because
>> skb_warn_if_lro(), called by this method,  returns TRUE)
>>
>> My question is:
>> dev_disable_lro(dev) disable the LRO feature (NETIF_F_LRO) of the
>> device. But suppose I have a device where GRO is enabled (and LRO is
>> not). And let's say I set forwarding on this device.
>>
>> it seems to me that is such case, calling dev_disable_lro(dev)  in
>> net_forward_change() to disable the LRO feature of the device (which
>> is already disabled) is not enough, and in such case , GRO packets,
>> which want to be forwarded,  will **not** be dropped in ip_forward().
>> (since kb_warn_if_lro() will return false in this case)
>>
>> Is it so ? I am ready to send a patch fixing it, but I am a newbie in
>> kernel, so I want to ask first.
>
> GRO packets can be forwarded just fine.
>
>
>
>

^ permalink raw reply

* Re: Unwanted aliasing of UDP checksum failed error counter
From: Eric Dumazet @ 2010-10-26 20:27 UTC (permalink / raw)
  To: Jeremy Jackson; +Cc: netdev
In-Reply-To: <a2ae895d871551a9b3ded9ce3874fe28.squirrel@imap.coplanar.net>

Le mardi 26 octobre 2010 à 16:21 -0400, Jeremy Jackson a écrit :
> > Le mardi 26 octobre 2010 Ã  15:53 -0400, Jeremy Jackson a Ã©crit :
> >> Trying to find source of packet loss on an 8node compute cluster, we
> >> find:
> >> (not in this example, but on the real cluster)
> >>
> >> in /proc/sys/net/snmp
> >> Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors
> >> Udp: 976460 1750 0 986795 0 0
> >>
> >> InErrors *and* RcvbufErrors both go up with full socket buffer, this has
> >> made troubleshooting our application more difficult.  We were chasing
> >> UDP
> >> checksum problems, until we checked linux source code, and found
> >> aliasing.
> >>
> >> Is this done for assembly code efficiency?  Any reason ENOMEM (ie socket
> >> buffer full) can't avoid aliasing to UDP checksum failed errors?
> >>
> >> in linux-source-2.6.32/net/ipv4/udp.c:__udp_queue_rcv_skb()
> >> ....
> >>                 /* Note that an ENOMEM error is charged twice */
> >>                 if (rc == -ENOMEM) {
> >>                         UDP_INC_STATS_BH(sock_net(sk),
> >> UDP_MIB_RCVBUFERRORS,
> >>                                          is_udplite);
> >>                         atomic_inc(&sk->sk_drops);
> >>                 }
> >>                 goto drop;
> >> ...
> >> drop:
> >>         UDP_INC_STATS_BH(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
> >>
> >
> > In MIBS, there is no counter for UDP checksum errors
> >
> > So we use the standard UDP_MIB_INERRORS
> 
> Yes, this part I understand, but what I don't understand is why ENOMEM
> errors *and* checksum errors both use the same counter, while ENOMEM has
> it's own already.

Because ENOMEM errors were handled in commit 81aa646c,
but _all_ errors must also be accounted in INERRORS, to be RFC
compliant.

[IPV4]: add the UdpSndbufErrors and UdpRcvbufErrors MIBs


If we add a new MIB counter for checksum errors, a bad checksum packet
will increment both this new counter and INERRORS.




^ permalink raw reply

* Re: netlink stats: Ability to get stats for a single device?
From: Ben Greear @ 2010-10-26 20:29 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1288122986.2652.20.camel@edumazet-laptop>

On 10/26/2010 12:56 PM, Eric Dumazet wrote:
> Le mardi 26 octobre 2010 à 12:38 -0700, David Miller a écrit :
>> From: Ben Greear<greearb@candelatech.com>
>> Date: Tue, 26 Oct 2010 12:31:12 -0700
>>
>>> Am I missing something, or do I just need to write up a patch
>>> to have netlink pay attention to the ifindex?
>>
>> Setting the ->ifi_index or IFLA_IFNAME attribute values appropriately
>> in the getlink request doesn't work?
>>
>> That should give you back, amonst other things, the rtnl_link_stats
>> for the device in the netlink response.
>> --
>
> Yep, it should be easy to change iproute2 to not ask a full dump
> in ip/ipaddress.c :
>
> if (rtnl_wilddump_request(&rth, preferred_family, RTM_GETLINK)<  0) ...
>
> and instead use rtnl_send() or something like that, if user provided one
> specific interface name   (or index)
>
> ip link show dev eth0

I'm trying to craft my own netlink message...basically:

    memset(&snl, 0, sizeof(snl));
    snl.nl_family = AF_NETLINK;
    snl.nl_pid    = 0;
    snl.nl_groups = 0;

    memset(&buffer, 0, sizeof(buffer));
    nlh->nlmsg_type = msg_type;
    nlh->nlmsg_flags = NLM_F_MATCH|NLM_F_REQUEST;
    static unsigned int nl_seqno = 1;
    nlh->nlmsg_seq = nl_seqno++;
    nlh->nlmsg_pid = nl_pid;



       nlh->nlmsg_len = NLMSG_LENGTH(sizeof(*ifinfomsg));
       ifinfomsg = (struct ifinfomsg*)(NLMSG_DATA(nlh));
       ifinfomsg->ifi_family = AF_UNSPEC;
       ifinfomsg->ifi_type = IFLA_UNSPEC;
       ifinfomsg->ifi_index = if_index;
       ifinfomsg->ifi_flags = 0;
       ifinfomsg->ifi_change = 0xffffffff;


It's possible that I'm somehow messing this up. But, looking at the
static int rtnl_dump_ifinfo(struct sk_buff *skb, struct netlink_callback *cb)
method, I cannot see how it would bail out properly after a single dev
has been processed, either.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply

* Re: netlink stats: Ability to get stats for a single device?
From: Eric Dumazet @ 2010-10-26 20:37 UTC (permalink / raw)
  To: Ben Greear; +Cc: David Miller, netdev
In-Reply-To: <4CC73A1E.5050605@candelatech.com>

Le mardi 26 octobre 2010 à 13:29 -0700, Ben Greear a écrit :

> I'm trying to craft my own netlink message...basically:
> 
>     memset(&snl, 0, sizeof(snl));
>     snl.nl_family = AF_NETLINK;
>     snl.nl_pid    = 0;
>     snl.nl_groups = 0;
> 
>     memset(&buffer, 0, sizeof(buffer));
>     nlh->nlmsg_type = msg_type;
>     nlh->nlmsg_flags = NLM_F_MATCH|NLM_F_REQUEST;

dont use F_MATCH : check net/core/rtnetlink.c

vi +1660 net/core/rtnetlink.c

You _dont_ want to call 'dumpit' : so dont use a bit present in
NLM_F_DUMP at all:

        if (kind == 2 && nlh->nlmsg_flags&NLM_F_DUMP) {
                struct sock *rtnl;
                rtnl_dumpit_func dumpit;

                dumpit = rtnl_get_dumpit(family, type);
                if (dumpit == NULL)
                        return -EOPNOTSUPP;

                __rtnl_unlock();
                rtnl = net->rtnl;
                err = netlink_dump_start(rtnl, skb, nlh, dumpit, NULL);
                rtnl_lock();
                return err;
        }

You want instead to call the 'doit' handler  (one device only)

        doit = rtnl_get_doit(family, type);
        if (doit == NULL)
                return -EOPNOTSUPP;

        return doit(skb, nlh, (void *)&rta_buf[0]);




^ permalink raw reply

* Re: GSO and IPv4 forwarding
From: Eric Dumazet @ 2010-10-26 20:43 UTC (permalink / raw)
  To: Kevin Wilson; +Cc: netdev
In-Reply-To: <AANLkTik+s4CHgPuY6uhe-0aD=WJs=UHhoEdok4AyTdYJ@mail.gmail.com>

Le mardi 26 octobre 2010 à 22:26 +0200, Kevin Wilson a écrit :
> Hi,
> Thanks a lot for your quick answer, I appreciate it (and did not
> expect it to be so quick!)
> 
> Can someone please explain in 2-3 short sentences Why GRO can be
> forwarded and LRO cannot be forwarded ?

Well, GRO is a pure software thing, completely handled in linux stack,
not driver specific. Its design included forwarding ability, LRO did
not.

http://lwn.net/Articles/311357/




^ permalink raw reply

* Re: netlink stats: Ability to get stats for a single device?
From: Ben Greear @ 2010-10-26 20:43 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1288125432.2652.39.camel@edumazet-laptop>

On 10/26/2010 01:37 PM, Eric Dumazet wrote:
> Le mardi 26 octobre 2010 à 13:29 -0700, Ben Greear a écrit :
>
>> I'm trying to craft my own netlink message...basically:
>>
>>      memset(&snl, 0, sizeof(snl));
>>      snl.nl_family = AF_NETLINK;
>>      snl.nl_pid    = 0;
>>      snl.nl_groups = 0;
>>
>>      memset(&buffer, 0, sizeof(buffer));
>>      nlh->nlmsg_type = msg_type;
>>      nlh->nlmsg_flags = NLM_F_MATCH|NLM_F_REQUEST;
>
> dont use F_MATCH : check net/core/rtnetlink.c
>
> vi +1660 net/core/rtnetlink.c
>
> You _dont_ want to call 'dumpit' : so dont use a bit present in
> NLM_F_DUMP at all:

That was exactly my problem.  It works as expected with that NLM_F_MATCH
removed.

Thanks!
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply

* Re: netns patches WAS( Re: [PATCH 8/8] net: Implement socketat.
From: jamal @ 2010-10-26 20:52 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Daniel Lezcano, Pavel Emelyanov, linux-kernel, Linux Containers,
	netdev, netfilter-devel, linux-fsdevel, Linus Torvalds,
	Michael Kerrisk, Ulrich Drepper, Al Viro, David Miller,
	Serge E. Hallyn, Pavel Emelyanov, Ben Greear, Matt Helsley,
	Jonathan Corbet, Sukadev Bhattiprolu, Jan Engelhardt,
	Patrick McHardy
In-Reply-To: <1287145855.3642.30.camel@bigi>

Eric,

Ping?
If you are too busy to push these in maybe have
someone clueful like Daniel help out submitting? I think it
should probably be reasonable to leave out the sockeat
patch initially if it is deemed controversial..

cheers,
jamal

On Fri, 2010-10-15 at 08:30 -0400, jamal wrote:
> Eric et al,
> 
> Did these patches make it in? I was looking at
> two Davem net trees and i dont see them.
> 
> cheers,
> jamal
> 



^ permalink raw reply

* Re: GSO and IPv4 forwarding
From: Stephen Hemminger @ 2010-10-26 20:52 UTC (permalink / raw)
  To: Kevin Wilson; +Cc: Eric Dumazet, netdev
In-Reply-To: <AANLkTik+s4CHgPuY6uhe-0aD=WJs=UHhoEdok4AyTdYJ@mail.gmail.com>

On Tue, 26 Oct 2010 22:26:07 +0200
Kevin Wilson <wkevils@gmail.com> wrote:

> Hi,
> Thanks a lot for your quick answer, I appreciate it (and did not
> expect it to be so quick!)
> 
> Can someone please explain in 2-3 short sentences Why GRO can be
> forwarded and LRO cannot be forwarded ?
> rgs,
> Kevin

LRO merges packets together creating one large skb.  This is a layering
violation for forwarding or bridging (it violates end to end principle).

GRO maintains the headers of each packet and passes them as
a cluster.

^ permalink raw reply

* Re: [RFC PATCH 5/9] ipvs network name space aware
From: Simon Horman @ 2010-10-26 21:03 UTC (permalink / raw)
  To: Hans Schillstrom
  Cc: lvs-devel@vger.kernel.org, netdev@vger.kernel.org,
	netfilter-devel@vger.kernel.org, ja@ssi.bg, wensong@linux-vs.org,
	daniel.lezcano@free.fr
In-Reply-To: <201010261507.39734.hans.schillstrom@ericsson.com>

On Tue, Oct 26, 2010 at 03:07:38PM +0200, Hans Schillstrom wrote:
> On Friday 22 October 2010 21:05:48 Simon Horman wrote:
> > On Fri, Oct 08, 2010 at 01:17:02PM +0200, Hans Schillstrom wrote:
> > > This patch just contains ip_vs_ctl
> > >
> > > Signed-off-by:Hans Schillstrom <hans.schillstrom@ericsson.com>
> > >
> > > diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> > > index ca8ec8c..7e99cbc 100644
> > > --- a/net/netfilter/ipvs/ip_vs_ctl.c
> > > +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> >
> > [ snip ]
> > Hi Hans,
> >
> > is there a reason that the order some of the entries in
> > vs_vars has been switched around?
> >
> Yes there is, when vars will be copied to it's own NS it's a lot easier
> when they are in sequence and without a potential insert in the middle
> the #if 0 ...
> 
> have a look at __ip_vs_control_init(struct net *net)

Ok, I suspected something like that.
In that case I think there may be a problem with
the way that I handled sysctl_ip_vs_conntrack.

^ permalink raw reply

* Re: [net-next PATCH 1/3] qlge: Restoring the vlan setting.
From: Ron Mercer @ 2010-10-26 20:54 UTC (permalink / raw)
  To: Jesse Gross
  Cc: David Miller, netdev@vger.kernel.org, Jitendra Kalsaria,
	Ying Ping Lok
In-Reply-To: <AANLkTinr5fHjgPNieYjJuegU2EKWiQG00rAeatZV_=3w@mail.gmail.com>

On Mon, Oct 25, 2010 at 05:56:57PM -0700, Jesse Gross wrote:
> 
> Using vlan groups within a driver is now deprecated.  I realize that
> this is just a bug fix but it would nice if we can avoid introducing
> more code around vlan groups.  Of course, fully switching the driver
> over to use the new vlan model would be even nicer.

I would like this bug fix to be applied though we will schedule switching
to the new vlan model ASAP.


^ permalink raw reply

* Reply Needed
From: Dr Carl Lee @ 2010-10-26 23:12 UTC (permalink / raw)
  To: netdev

Hello,

I have a proposition for you, this however is not mandatory nor will I
in any manner compel you to honor against your will.Let me start by
introducing myself. I am Dr.Carl Lee, Director of Operations
of the Hang Seng Bank Ltd,Sai Wan Ho Branch. I have a mutual beneficial
business suggestion for you.

1. Can you handle this project?
2. Can I give you this trust ?

Absolute confidentiality is required from you.Besides,I will use my connection
to get some documents to back up the fund so that the fund can not be
question by any authority.

More information await you in my next response to your email message.

Treat as very urgent.

Yours Faithfully,

Dr. Carl Lee.

^ permalink raw reply

* Re: [PATCH] gianfar: Fix crashes on RX path
From: Jarek Poplawski @ 2010-10-26 21:20 UTC (permalink / raw)
  To: David Miller
  Cc: eric.dumazet, eminak71, akpm, netdev, bugzilla-daemon,
	bugme-daemon, avorontsov, afleming
In-Reply-To: <20101026.104257.245396217.davem@davemloft.net>

On Tue, Oct 26, 2010 at 10:42:57AM -0700, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Fri, 22 Oct 2010 08:52:48 +0000
> 
> > On Fri, Oct 22, 2010 at 06:52:31AM +0000, Jarek Poplawski wrote:
> >> On Fri, Oct 22, 2010 at 08:11:57AM +0200, Eric Dumazet wrote:
> > ...
> >> > Gianfar claims to be multiqueue, but only one cpu can run gfar_poll()
> >> > and call gfar_clean_tx_ring() / gfar_clean_rx_ring()
> >> > 
> >> > If not, there would be more bugs than only rx_recycle thing
> >> 
> >> I didn't find what prevents running gfar_poll on many cpus and don't
> >> claim there is no more bugs around.
> > 
> > On the other hand, I don't see your point in the code below either.
> > These're only per gfargrp queues - not per device, aren't they?
> 
> I am still not at the point where I feel confortable applying this bug
> fix, in fact I am very far from that.
> 
> None of the logic is consistent in what we are saying causes the
> problem.
> 
> Anything that would make the RX recycling code racy and corrupt
> recycling queue of the gianfar driver, would also corrupt all of the
> other RX side and other driver state.
> 
> The NAPI state is unary for gianfar, and inside of that singular
> ->poll() instance it iterates over the queues.

IMHO, the NAPI state is unary only for gfargrp, and multiple ->poll()
instances share a (unary) rx_recycle queue without proper locking.

Thanks,
Jarek P.

^ permalink raw reply

* Re: [net-next PATCH 1/3] qlge: Restoring the vlan setting.
From: David Miller @ 2010-10-26 21:21 UTC (permalink / raw)
  To: ron.mercer; +Cc: jesse, netdev, jitendra.kalsaria, ying.lok
In-Reply-To: <20101026205422.GC30008@linux-ox1b.qlogic.org>

From: Ron Mercer <ron.mercer@qlogic.com>
Date: Tue, 26 Oct 2010 13:54:22 -0700

> On Mon, Oct 25, 2010 at 05:56:57PM -0700, Jesse Gross wrote:
>> 
>> Using vlan groups within a driver is now deprecated.  I realize that
>> this is just a bug fix but it would nice if we can avoid introducing
>> more code around vlan groups.  Of course, fully switching the driver
>> over to use the new vlan model would be even nicer.
> 
> I would like this bug fix to be applied though we will schedule switching
> to the new vlan model ASAP.

Then why did you put "net-next-2.6" in the subject lines?

^ permalink raw reply

* Re: [PATCH] gianfar: Fix crashes on RX path
From: David Miller @ 2010-10-26 21:23 UTC (permalink / raw)
  To: jarkao2
  Cc: eric.dumazet, eminak71, akpm, netdev, bugzilla-daemon,
	bugme-daemon, avorontsov, afleming
In-Reply-To: <20101026212042.GA1888@del.dom.local>

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Tue, 26 Oct 2010 23:20:42 +0200

> IMHO, the NAPI state is unary only for gfargrp, and multiple ->poll()
> instances share a (unary) rx_recycle queue without proper locking.

Ok, I see it now, thank you.

For now I'll apply your patch, but long term I think we should just
eradicate the recycling code from the entire tree.

^ permalink raw reply

* Re: [RFC][net-next-2.6 PATCH 3/4] ethtool: set hard_header_len using ETH_FLAG_{TX|RX}VLAN
From: John Fastabend @ 2010-10-26 21:58 UTC (permalink / raw)
  To: Jesse Gross; +Cc: Ben Hutchings, netdev@vger.kernel.org
In-Reply-To: <AANLkTikJpsFC=ZsTWtxwSp-aQRDoZ9fy1p-ZxR+v8jVm@mail.gmail.com>

On 10/25/2010 3:45 PM, Jesse Gross wrote:
> On Fri, Oct 22, 2010 at 6:00 AM, Ben Hutchings
> <bhutchings@solarflare.com> wrote:
>> On Thu, 2010-10-21 at 15:10 -0700, John Fastabend wrote:
>>> Toggling the vlan tx|rx hw offloads needs to set the hard_header_len
>>> as well otherwise we end up using LL_RESERVED_SPACE incorrectly.
>>> This results in pskb_expand_head() being used unnecessarily.
>>>
>>> This add a check in ethtool_op_set_flags to catch the ETH_FLAG_TXVLAN
>>> flag and set the header length.
>> [...]
>>
>> Note that not every driver that implements the set_flags operation calls
>> back to ethtool_op_set_flags().
> 
> Currently all of the drivers that support toggling this using ethtool
> call into ethtool_op_set_flags.  Even if they don't, things will
> continue to work correctly, albeit with a performance hit, so it's not
> a catastrophe.
> 
> This does assume that drivers which support offloading will start with
> it enabled.  If they don't and just use the non-vlan header length
> then this will drop the header length down even further when
> offloading is enabled.  All current drivers that support toggling do
> start with offloading enabled, so maybe it's not that big a deal.
> 
> Another issue is that cards that don't support vlan offloading at all
> probably won't take the header into account, so they'll get hit every
> time.
> 

The lower layer driver should not include the vlan tag in
hard_header_len because pkts pushed to the real net device will not add
the vlan tag. The vlan device however should increment/dec the len value
depending on if the underlying net device is offloading the vlan tagging.

> When we are using vlan devices we also manually add the vlan header
> length but it doesn't update if we change the underlying device.  It
> seems a little redundant to have to do it in both places.

Right, I think doing this in vlan_transfer_features() should work.
	
> 
> I like that this is generic and independent of vlan devices.
> Hopefully we can figure out these corner cases (or maybe decide that
> they're not important or this is strictly an improvement).

I'll post an update. Thanks for the comments.

-- John

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* [RFC][net-next-2.6 PATCH v2] 8021q: set hard_header_len when VLAN offload features are toggled
From: John Fastabend @ 2010-10-26 21:59 UTC (permalink / raw)
  To: jesse; +Cc: john.r.fastabend, netdev, bhutchings

Toggling the vlan tx|rx hw offloads needs to set the hard_header_len
as well otherwise we end up using LL_RESERVED_SPACE incorrectly.
This results in pskb_expand_head() being used unnecessarily.

This add a check in vlan_transfer_features  to catch the ETH_FLAG_TXVLAN
flag and set the header length. This requires drivers to add the
ETH_FLAG_TXVLAN to vlan_features.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---

 net/8021q/vlan.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 05b867e..825011b 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -334,6 +334,16 @@ static void vlan_transfer_features(struct net_device *dev,
 	vlandev->features &= ~dev->vlan_features;
 	vlandev->features |= dev->features & dev->vlan_features;
 	vlandev->gso_max_size = dev->gso_max_size;
+
+	/* is ETH_FLAGS_TXVLAN being toggled */
+	if ((vlandev->features & ETH_FLAG_TXVLAN) ^
+	    (old_features & ETH_FLAG_TXVLAN)) {
+		if (vlandev->features & ETH_FLAG_TXVLAN)
+			vlandev->hard_header_len -= VLAN_HLEN;
+		else
+			vlandev->hard_header_len += VLAN_HLEN;
+	}
+
 #if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
 	vlandev->fcoe_ddp_xid = dev->fcoe_ddp_xid;
 #endif


^ permalink raw reply related

* Re: [RFC][net-next-2.6 PATCH 4/4] net: remove check for headroom in vlan_dev_create
From: John Fastabend @ 2010-10-26 22:05 UTC (permalink / raw)
  To: Jesse Gross; +Cc: netdev@vger.kernel.org
In-Reply-To: <AANLkTinREMV1bRhoJuiRV805-e1NKdZD6ejyhYczpw_E@mail.gmail.com>

On 10/25/2010 3:45 PM, Jesse Gross wrote:
> On Thu, Oct 21, 2010 at 3:10 PM, John Fastabend
> <john.r.fastabend@intel.com> wrote:
>> It is possible for the headroom to be smaller then the
>> hard_header_len for a short period of time after toggling
>> the vlan offload setting.
>>
>> This is not a hard error and skb_cow_head is called in
>> __vlan_put_tag() to resolve this.
>>
>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> 
> How is it possible that the hard_header_len changes on the vlan
> device?  It looks like the header length never gets changed after it
> is initialized.  There's no set_flags method in the vlan device to
> toggle whether it is using offloading or not, it just rides on top of
> the underlying device.

Your right and I think this is why my previous patch was broken. If we
can toggle the underlying offloads we should set the header length as
well. With the updated patch I just sent this should be true now.

Thanks,
John.

> On the other hand, I agree that this check isn't actually necessary.


^ permalink raw reply

* [PATCH] OF device tree: Move of_get_mac_address() to a common source file.
From: David Daney @ 2010-10-26 22:07 UTC (permalink / raw)
  To: linux-mips, ralf, devicetree-discuss, grant.likely, linux-kernel
  Cc: David Daney, Michal Simek, Benjamin Herrenschmidt, Wolfram Sang,
	Paul Mackerras, David S. Miller, Corey Minyard, Pantelis Antoniou,
	Vitaly Bordug, Anatolij Gustschin, John Rigby, Wolfgang Denk,
	Anton Vorontsov, Sandeep Gopalpet, Kumar Gala, Li Yang,
	Sergey Matyukevich, Jiri Pirko, Eric Dumazet, Sean MacLennan,
	Sadanand Mutyala, Andres Salomon, microblaze-ucli

There are two identical implementations of of_get_mac_address(), one
each in arch/powerpc/kernel/prom_parse.c and
arch/microblaze/kernel/prom_parse.c.  Move this function to a new
common file of_net.{c,h} and adjust all the callers to include the new
header.

Signed-off-by: David Daney <ddaney@caviumnetworks.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Grant Likely <grant.likely@secretlab.ca>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Wolfram Sang <w.sang@pengutronix.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Corey Minyard <cminyard@mvista.com>
Cc: Pantelis Antoniou <pantelis.antoniou@gmail.com>
Cc: Vitaly Bordug <vbordug@ru.mvista.com>
Cc: Anatolij Gustschin <agust@denx.de>
Cc: John Rigby <jcrigby@gmail.com>
Cc: Wolfgang Denk <wd@denx.de>
Cc: Anton Vorontsov <avorontsov@mvista.com>
Cc: Sandeep Gopalpet <Sandeep.Kumar@freescale.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Li Yang <leoli@freescale.com>
Cc: Sergey Matyukevich <geomatsi@gmail.com>
Cc: Jiri Pirko <jpirko@redhat.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sean MacLennan <smaclennan@pikatech.com>
Cc: Sadanand Mutyala <Sadanand.Mutyala@xilinx.com>
Cc: Andres Salomon <dilinger@queued.net>
Cc: microblaze-uclinux@itee.uq.edu.au
Cc: linux-kernel@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: netdev@vger.kernel.org
Cc: devicetree-discuss@lists.ozlabs.org
---

Note: This seems to work for my MIPS/Octeon development, but has been
tested neither on powerpc nor microblaze targets.

 arch/microblaze/include/asm/prom.h  |    3 --
 arch/microblaze/kernel/prom_parse.c |   38 ---------------------------
 arch/powerpc/include/asm/prom.h     |    3 --
 arch/powerpc/kernel/prom_parse.c    |   38 ---------------------------
 arch/powerpc/sysdev/mv64x60_dev.c   |    1 +
 arch/powerpc/sysdev/tsi108_dev.c    |    1 +
 drivers/net/fs_enet/fs_enet-main.c  |    1 +
 drivers/net/gianfar.c               |    1 +
 drivers/net/ucc_geth.c              |    1 +
 drivers/net/xilinx_emaclite.c       |    1 +
 drivers/of/Kconfig                  |    3 ++
 drivers/of/Makefile                 |    1 +
 drivers/of/of_net.c                 |   48 +++++++++++++++++++++++++++++++++++
 include/linux/of_net.h              |   13 +++++++++
 14 files changed, 71 insertions(+), 82 deletions(-)
 create mode 100644 drivers/of/of_net.c
 create mode 100644 include/linux/of_net.h

diff --git a/arch/microblaze/include/asm/prom.h b/arch/microblaze/include/asm/prom.h
index 101fa09..392b3ad 100644
--- a/arch/microblaze/include/asm/prom.h
+++ b/arch/microblaze/include/asm/prom.h
@@ -63,9 +63,6 @@ extern void kdump_move_device_tree(void);
 /* CPU OF node matching */
 struct device_node *of_get_cpu_node(int cpu, unsigned int *thread);
 
-/* Get the MAC address */
-extern const void *of_get_mac_address(struct device_node *np);
-
 /**
  * of_irq_map_pci - Resolve the interrupt for a PCI device
  * @pdev:	the device whose interrupt is to be resolved
diff --git a/arch/microblaze/kernel/prom_parse.c b/arch/microblaze/kernel/prom_parse.c
index 99d9b61..9ae24f4 100644
--- a/arch/microblaze/kernel/prom_parse.c
+++ b/arch/microblaze/kernel/prom_parse.c
@@ -110,41 +110,3 @@ void of_parse_dma_window(struct device_node *dn, const void *dma_window_prop,
 	cells = prop ? *(u32 *)prop : of_n_size_cells(dn);
 	*size = of_read_number(dma_window, cells);
 }
-
-/**
- * Search the device tree for the best MAC address to use.  'mac-address' is
- * checked first, because that is supposed to contain to "most recent" MAC
- * address. If that isn't set, then 'local-mac-address' is checked next,
- * because that is the default address.  If that isn't set, then the obsolete
- * 'address' is checked, just in case we're using an old device tree.
- *
- * Note that the 'address' property is supposed to contain a virtual address of
- * the register set, but some DTS files have redefined that property to be the
- * MAC address.
- *
- * All-zero MAC addresses are rejected, because those could be properties that
- * exist in the device tree, but were not set by U-Boot.  For example, the
- * DTS could define 'mac-address' and 'local-mac-address', with zero MAC
- * addresses.  Some older U-Boots only initialized 'local-mac-address'.  In
- * this case, the real MAC is in 'local-mac-address', and 'mac-address' exists
- * but is all zeros.
-*/
-const void *of_get_mac_address(struct device_node *np)
-{
-	struct property *pp;
-
-	pp = of_find_property(np, "mac-address", NULL);
-	if (pp && (pp->length == 6) && is_valid_ether_addr(pp->value))
-		return pp->value;
-
-	pp = of_find_property(np, "local-mac-address", NULL);
-	if (pp && (pp->length == 6) && is_valid_ether_addr(pp->value))
-		return pp->value;
-
-	pp = of_find_property(np, "address", NULL);
-	if (pp && (pp->length == 6) && is_valid_ether_addr(pp->value))
-		return pp->value;
-
-	return NULL;
-}
-EXPORT_SYMBOL(of_get_mac_address);
diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index ae26f2e..98264bf 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -63,9 +63,6 @@ struct device_node *of_get_cpu_node(int cpu, unsigned int *thread);
 /* cache lookup */
 struct device_node *of_find_next_cache_node(struct device_node *np);
 
-/* Get the MAC address */
-extern const void *of_get_mac_address(struct device_node *np);
-
 #ifdef CONFIG_NUMA
 extern int of_node_to_nid(struct device_node *device);
 #else
diff --git a/arch/powerpc/kernel/prom_parse.c b/arch/powerpc/kernel/prom_parse.c
index 88334af..c2b7a07 100644
--- a/arch/powerpc/kernel/prom_parse.c
+++ b/arch/powerpc/kernel/prom_parse.c
@@ -117,41 +117,3 @@ void of_parse_dma_window(struct device_node *dn, const void *dma_window_prop,
 	cells = prop ? *(u32 *)prop : of_n_size_cells(dn);
 	*size = of_read_number(dma_window, cells);
 }
-
-/**
- * Search the device tree for the best MAC address to use.  'mac-address' is
- * checked first, because that is supposed to contain to "most recent" MAC
- * address. If that isn't set, then 'local-mac-address' is checked next,
- * because that is the default address.  If that isn't set, then the obsolete
- * 'address' is checked, just in case we're using an old device tree.
- *
- * Note that the 'address' property is supposed to contain a virtual address of
- * the register set, but some DTS files have redefined that property to be the
- * MAC address.
- *
- * All-zero MAC addresses are rejected, because those could be properties that
- * exist in the device tree, but were not set by U-Boot.  For example, the
- * DTS could define 'mac-address' and 'local-mac-address', with zero MAC
- * addresses.  Some older U-Boots only initialized 'local-mac-address'.  In
- * this case, the real MAC is in 'local-mac-address', and 'mac-address' exists
- * but is all zeros.
-*/
-const void *of_get_mac_address(struct device_node *np)
-{
-	struct property *pp;
-
-	pp = of_find_property(np, "mac-address", NULL);
-	if (pp && (pp->length == 6) && is_valid_ether_addr(pp->value))
-		return pp->value;
-
-	pp = of_find_property(np, "local-mac-address", NULL);
-	if (pp && (pp->length == 6) && is_valid_ether_addr(pp->value))
-		return pp->value;
-
-	pp = of_find_property(np, "address", NULL);
-	if (pp && (pp->length == 6) && is_valid_ether_addr(pp->value))
-		return pp->value;
-
-	return NULL;
-}
-EXPORT_SYMBOL(of_get_mac_address);
diff --git a/arch/powerpc/sysdev/mv64x60_dev.c b/arch/powerpc/sysdev/mv64x60_dev.c
index 1398bc4..feaee40 100644
--- a/arch/powerpc/sysdev/mv64x60_dev.c
+++ b/arch/powerpc/sysdev/mv64x60_dev.c
@@ -16,6 +16,7 @@
 #include <linux/mv643xx.h>
 #include <linux/platform_device.h>
 #include <linux/of_platform.h>
+#include <linux/of_net.h>
 #include <linux/dma-mapping.h>
 
 #include <asm/prom.h>
diff --git a/arch/powerpc/sysdev/tsi108_dev.c b/arch/powerpc/sysdev/tsi108_dev.c
index d4d15aa..c2d675b 100644
--- a/arch/powerpc/sysdev/tsi108_dev.c
+++ b/arch/powerpc/sysdev/tsi108_dev.c
@@ -19,6 +19,7 @@
 #include <linux/module.h>
 #include <linux/device.h>
 #include <linux/platform_device.h>
+#include <linux/of_net.h>
 #include <asm/tsi108.h>
 
 #include <asm/system.h>
diff --git a/drivers/net/fs_enet/fs_enet-main.c b/drivers/net/fs_enet/fs_enet-main.c
index d6e3111..acba64b 100644
--- a/drivers/net/fs_enet/fs_enet-main.c
+++ b/drivers/net/fs_enet/fs_enet-main.c
@@ -40,6 +40,7 @@
 #include <linux/of_mdio.h>
 #include <linux/of_platform.h>
 #include <linux/of_gpio.h>
+#include <linux/of_net.h>
 
 #include <linux/vmalloc.h>
 #include <asm/pgtable.h>
diff --git a/drivers/net/gianfar.c b/drivers/net/gianfar.c
index 4f7c3f3..773909b 100644
--- a/drivers/net/gianfar.c
+++ b/drivers/net/gianfar.c
@@ -95,6 +95,7 @@
 #include <linux/phy.h>
 #include <linux/phy_fixed.h>
 #include <linux/of.h>
+#include <linux/of_net.h>
 
 #include "gianfar.h"
 #include "fsl_pq_mdio.h"
diff --git a/drivers/net/ucc_geth.c b/drivers/net/ucc_geth.c
index a4c3f57..f7e370f 100644
--- a/drivers/net/ucc_geth.c
+++ b/drivers/net/ucc_geth.c
@@ -28,6 +28,7 @@
 #include <linux/phy.h>
 #include <linux/workqueue.h>
 #include <linux/of_mdio.h>
+#include <linux/of_net.h>
 #include <linux/of_platform.h>
 
 #include <asm/uaccess.h>
diff --git a/drivers/net/xilinx_emaclite.c b/drivers/net/xilinx_emaclite.c
index ecbbb68..527e1ea 100644
--- a/drivers/net/xilinx_emaclite.c
+++ b/drivers/net/xilinx_emaclite.c
@@ -24,6 +24,7 @@
 #include <linux/of_device.h>
 #include <linux/of_platform.h>
 #include <linux/of_mdio.h>
+#include <linux/of_net.h>
 #include <linux/phy.h>
 
 #define DRIVER_NAME "xilinx_emaclite"
diff --git a/drivers/of/Kconfig b/drivers/of/Kconfig
index aa675eb..8184778 100644
--- a/drivers/of/Kconfig
+++ b/drivers/of/Kconfig
@@ -61,4 +61,7 @@ config OF_MDIO
 	help
 	  OpenFirmware MDIO bus (Ethernet PHY) accessors
 
+config OF_NET
+	def_bool y
+
 endmenu # OF
diff --git a/drivers/of/Makefile b/drivers/of/Makefile
index 7888155..6854ced 100644
--- a/drivers/of/Makefile
+++ b/drivers/of/Makefile
@@ -8,3 +8,4 @@ obj-$(CONFIG_OF_GPIO)   += gpio.o
 obj-$(CONFIG_OF_I2C)	+= of_i2c.o
 obj-$(CONFIG_OF_SPI)	+= of_spi.o
 obj-$(CONFIG_OF_MDIO)	+= of_mdio.o
+obj-$(CONFIG_OF_NET)	+= of_net.o
diff --git a/drivers/of/of_net.c b/drivers/of/of_net.c
new file mode 100644
index 0000000..86f334a
--- /dev/null
+++ b/drivers/of/of_net.c
@@ -0,0 +1,48 @@
+/*
+ * OF helpers for network devices.
+ *
+ * This file is released under the GPLv2
+ *
+ * Initially copied out of arch/powerpc/kernel/prom_parse.c
+ */
+#include <linux/etherdevice.h>
+#include <linux/kernel.h>
+#include <linux/of_net.h>
+
+/**
+ * Search the device tree for the best MAC address to use.  'mac-address' is
+ * checked first, because that is supposed to contain to "most recent" MAC
+ * address. If that isn't set, then 'local-mac-address' is checked next,
+ * because that is the default address.  If that isn't set, then the obsolete
+ * 'address' is checked, just in case we're using an old device tree.
+ *
+ * Note that the 'address' property is supposed to contain a virtual address of
+ * the register set, but some DTS files have redefined that property to be the
+ * MAC address.
+ *
+ * All-zero MAC addresses are rejected, because those could be properties that
+ * exist in the device tree, but were not set by U-Boot.  For example, the
+ * DTS could define 'mac-address' and 'local-mac-address', with zero MAC
+ * addresses.  Some older U-Boots only initialized 'local-mac-address'.  In
+ * this case, the real MAC is in 'local-mac-address', and 'mac-address' exists
+ * but is all zeros.
+*/
+const void *of_get_mac_address(struct device_node *np)
+{
+	struct property *pp;
+
+	pp = of_find_property(np, "mac-address", NULL);
+	if (pp && (pp->length == 6) && is_valid_ether_addr(pp->value))
+		return pp->value;
+
+	pp = of_find_property(np, "local-mac-address", NULL);
+	if (pp && (pp->length == 6) && is_valid_ether_addr(pp->value))
+		return pp->value;
+
+	pp = of_find_property(np, "address", NULL);
+	if (pp && (pp->length == 6) && is_valid_ether_addr(pp->value))
+		return pp->value;
+
+	return NULL;
+}
+EXPORT_SYMBOL(of_get_mac_address);
diff --git a/include/linux/of_net.h b/include/linux/of_net.h
new file mode 100644
index 0000000..7c773ec
--- /dev/null
+++ b/include/linux/of_net.h
@@ -0,0 +1,13 @@
+/*
+ * OF helpers for network devices.
+ *
+ * This file is released under the GPLv2
+ */
+
+#ifndef __LINUX_OF_NET_H
+#define __LINUX_OF_NET_H
+
+#include <linux/of.h>
+const void *of_get_mac_address(struct device_node *np);
+
+#endif /* __LINUX_OF_NET_H */
-- 
1.7.2.3


^ permalink raw reply related

* [RFC PATCH] macvlan: Introduce a PASSTHRU mode to takeover the underlying device
From: Sridhar Samudrala @ 2010-10-26 22:19 UTC (permalink / raw)
  To: kaber, Arnd Bergmann; +Cc: netdev, kvm@vger.kernel.org

With the current default macvtap mode, a KVM guest using virtio with 
macvtap backend has the following limitations.
- cannot change/add a mac address on the guest virtio-net
- cannot create a vlan device on the guest virtio-net
- cannot enable promiscuous mode on guest virtio-net

This patch introduces a new mode called 'passthru' when creating a 
macvlan device which allows takeover of the underlying device and 
passing it to a guest using virtio with macvtap backend.

Only one macvlan device is allowed in passthru mode and it inherits
the mac address from the underlying device and sets it in promiscuous 
mode to receive and forward all the packets.

Thanks
Sridhar

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 0ef0eb0..bca3cb7 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -38,6 +38,7 @@ struct macvlan_port {
 	struct hlist_head	vlan_hash[MACVLAN_HASH_SIZE];
 	struct list_head	vlans;
 	struct rcu_head		rcu;
+	bool 			passthru;
 };
 
 #define macvlan_port_get_rcu(dev) \
@@ -169,6 +170,7 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
 			macvlan_broadcast(skb, port, NULL,
 					  MACVLAN_MODE_PRIVATE |
 					  MACVLAN_MODE_VEPA    |
+					  MACVLAN_MODE_PASSTHRU|
 					  MACVLAN_MODE_BRIDGE);
 		else if (src->mode == MACVLAN_MODE_VEPA)
 			/* flood to everyone except source */
@@ -185,7 +187,10 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
 		return skb;
 	}
 
-	vlan = macvlan_hash_lookup(port, eth->h_dest);
+	if (port->passthru)
+		vlan = list_first_entry(&port->vlans, struct macvlan_dev, list);
+	else
+		vlan = macvlan_hash_lookup(port, eth->h_dest);
 	if (vlan == NULL)
 		return skb;
 
@@ -284,6 +289,11 @@ static int macvlan_open(struct net_device *dev)
 	struct net_device *lowerdev = vlan->lowerdev;
 	int err;
 
+	if (vlan->port->passthru) {
+		dev_set_promiscuity(lowerdev, 1);
+		goto hash_add;
+	}
+
 	err = -EBUSY;
 	if (macvlan_addr_busy(vlan->port, dev->dev_addr))
 		goto out;
@@ -296,6 +306,8 @@ static int macvlan_open(struct net_device *dev)
 		if (err < 0)
 			goto del_unicast;
 	}
+
+hash_add:
 	macvlan_hash_add(vlan);
 	return 0;
 
@@ -310,12 +322,18 @@ static int macvlan_stop(struct net_device *dev)
 	struct macvlan_dev *vlan = netdev_priv(dev);
 	struct net_device *lowerdev = vlan->lowerdev;
 
+	if (vlan->port->passthru) {
+		dev_set_promiscuity(lowerdev, -1);
+		goto hash_del;
+	}
+
 	dev_mc_unsync(lowerdev, dev);
 	if (dev->flags & IFF_ALLMULTI)
 		dev_set_allmulti(lowerdev, -1);
 
 	dev_uc_del(lowerdev, dev->dev_addr);
 
+hash_del:
 	macvlan_hash_del(vlan);
 	return 0;
 }
@@ -549,6 +567,7 @@ static int macvlan_port_create(struct net_device *dev)
 	if (port == NULL)
 		return -ENOMEM;
 
+	port->passthru = false;
 	port->dev = dev;
 	INIT_LIST_HEAD(&port->vlans);
 	for (i = 0; i < MACVLAN_HASH_SIZE; i++)
@@ -593,6 +612,7 @@ static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[])
 		case MACVLAN_MODE_PRIVATE:
 		case MACVLAN_MODE_VEPA:
 		case MACVLAN_MODE_BRIDGE:
+		case MACVLAN_MODE_PASSTHRU:
 			break;
 		default:
 			return -EINVAL;
@@ -661,6 +681,10 @@ int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
 	}
 	port = macvlan_port_get(lowerdev);
 
+	/* Only 1 macvlan device can be created in passthru mode */
+	if (port->passthru)
+		return -EINVAL;
+
 	vlan->lowerdev = lowerdev;
 	vlan->dev      = dev;
 	vlan->port     = port;
@@ -671,6 +695,13 @@ int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
 	if (data && data[IFLA_MACVLAN_MODE])
 		vlan->mode = nla_get_u32(data[IFLA_MACVLAN_MODE]);
 
+	if (vlan->mode == MACVLAN_MODE_PASSTHRU) {
+		if (!list_empty(&port->vlans))
+			return -EINVAL;
+		port->passthru = true;
+		memcpy(dev->dev_addr, lowerdev->dev_addr, ETH_ALEN);
+	}
+
 	err = register_netdevice(dev);
 	if (err < 0)
 		goto destroy_port;
diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index 2fc66dd..8454805 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -232,6 +232,7 @@ enum macvlan_mode {
 	MACVLAN_MODE_PRIVATE = 1, /* don't talk to other macvlans */
 	MACVLAN_MODE_VEPA    = 2, /* talk to other ports through ext bridge */
 	MACVLAN_MODE_BRIDGE  = 4, /* talk to bridge ports directly */
+	MACVLAN_MODE_PASSTHRU = 8,/* take over the underlying device */
 };
 
 /* SR-IOV virtual function management section */



^ permalink raw reply related

* [PATCH iproute2] Add passthru mode and support 'mode' parameter with macvtap devices
From: Sridhar Samudrala @ 2010-10-26 22:19 UTC (permalink / raw)
  To: kaber, Arnd Bergmann; +Cc: netdev, kvm@vger.kernel.org

Support a new 'passthru' mode with macvlan and 'mode' parameter
with macvtap devices.

Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>

diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index f5bb2dc..23de79e 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -230,6 +230,7 @@ enum macvlan_mode {
 	MACVLAN_MODE_PRIVATE = 1, /* don't talk to other macvlans */
 	MACVLAN_MODE_VEPA    = 2, /* talk to other ports through ext bridge */
 	MACVLAN_MODE_BRIDGE  = 4, /* talk to bridge ports directly */
+	MACVLAN_MODE_PASSTHRU  = 8, /* take over the underlying device */
 };
 
 /* SR-IOV virtual function management section */
diff --git a/ip/Makefile b/ip/Makefile
index 2f223ca..6054e8a 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -3,7 +3,7 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o \
     ipmaddr.o ipmonitor.o ipmroute.o ipprefix.o iptuntap.o \
     ipxfrm.o xfrm_state.o xfrm_policy.o xfrm_monitor.o \
     iplink_vlan.o link_veth.o link_gre.o iplink_can.o \
-    iplink_macvlan.o
+    iplink_macvlan.o iplink_macvtap.o
 
 RTMONOBJ=rtmon.o
 
diff --git a/ip/iplink_macvlan.c b/ip/iplink_macvlan.c
index a3c78bd..97787f9 100644
--- a/ip/iplink_macvlan.c
+++ b/ip/iplink_macvlan.c
@@ -48,6 +48,8 @@ static int macvlan_parse_opt(struct link_util *lu, int argc, char **argv,
 				mode = MACVLAN_MODE_VEPA;
 			else if (strcmp(*argv, "bridge") == 0)
 				mode = MACVLAN_MODE_BRIDGE;
+			else if (strcmp(*argv, "passthru") == 0)
+				mode = MACVLAN_MODE_PASSTHRU;
 			else
 				return mode_arg();
 
@@ -82,6 +84,7 @@ static void macvlan_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[]
 		  mode == MACVLAN_MODE_PRIVATE ? "private"
 		: mode == MACVLAN_MODE_VEPA    ? "vepa"
 		: mode == MACVLAN_MODE_BRIDGE  ? "bridge"
+		: mode == MACVLAN_MODE_PASSTHRU  ? "passthru"
 		:				 "unknown");
 }
 
diff --git a/ip/iplink_macvtap.c b/ip/iplink_macvtap.c
new file mode 100644
index 0000000..040cc68
--- /dev/null
+++ b/ip/iplink_macvtap.c
@@ -0,0 +1,93 @@
+/*
+ * iplink_macvtap.c	macvtap device support
+ *
+ *              This program is free software; you can redistribute it and/or
+ *              modify it under the terms of the GNU General Public License
+ *              as published by the Free Software Foundation; either version
+ *              2 of the License, or (at your option) any later version.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/socket.h>
+#include <linux/if_link.h>
+
+#include "rt_names.h"
+#include "utils.h"
+#include "ip_common.h"
+
+static void explain(void)
+{
+	fprintf(stderr,
+		"Usage: ... macvtap mode { private | vepa | bridge | passthru }\n"
+	);
+}
+
+static int mode_arg(void)
+{
+        fprintf(stderr, "Error: argument of \"mode\" must be \"private\", "
+		"\"vepa\" or \"bridge\" \"passthru\"\n");
+        return -1;
+}
+
+static int macvtap_parse_opt(struct link_util *lu, int argc, char **argv,
+			  struct nlmsghdr *n)
+{
+	while (argc > 0) {
+		if (matches(*argv, "mode") == 0) {
+			__u32 mode = 0;
+			NEXT_ARG();
+
+			if (strcmp(*argv, "private") == 0)
+				mode = MACVLAN_MODE_PRIVATE;
+			else if (strcmp(*argv, "vepa") == 0)
+				mode = MACVLAN_MODE_VEPA;
+			else if (strcmp(*argv, "bridge") == 0)
+				mode = MACVLAN_MODE_BRIDGE;
+			else if (strcmp(*argv, "passthru") == 0)
+				mode = MACVLAN_MODE_PASSTHRU;
+			else
+				return mode_arg();
+
+			addattr32(n, 1024, IFLA_MACVLAN_MODE, mode);
+		} else if (matches(*argv, "help") == 0) {
+			explain();
+			return -1;
+		} else {
+			fprintf(stderr, "macvtap: what is \"%s\"?\n", *argv);
+			explain();
+			return -1;
+		}
+		argc--, argv++;
+	}
+
+	return 0;
+}
+
+static void macvtap_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
+{
+	__u32 mode;
+
+	if (!tb)
+		return;
+
+	if (!tb[IFLA_MACVLAN_MODE] ||
+	    RTA_PAYLOAD(tb[IFLA_MACVLAN_MODE]) < sizeof(__u32))
+		return;
+
+	mode = *(__u32 *)RTA_DATA(tb[IFLA_VLAN_ID]);
+	fprintf(f, " mode %s ",
+		  mode == MACVLAN_MODE_PRIVATE ? "private"
+		: mode == MACVLAN_MODE_VEPA    ? "vepa"
+		: mode == MACVLAN_MODE_BRIDGE  ? "bridge"
+		: mode == MACVLAN_MODE_PASSTHRU  ? "passthru"
+		:				 "unknown");
+}
+
+struct link_util macvtap_link_util = {
+	.id		= "macvtap",
+	.maxattr	= IFLA_MACVLAN_MAX,
+	.parse_opt	= macvtap_parse_opt,
+	.print_opt	= macvtap_print_opt,
+};



^ permalink raw reply related

* Re: [rfc v2 03/10] ipvs network name space aware: conn
From: Simon Horman @ 2010-10-26 22:35 UTC (permalink / raw)
  To: lvs-devel, netdev, netfilter-devel
  Cc: Hans Schillstrom, Julian Anastasov, Daniel Lezcano, Wensong Zhang
In-Reply-To: <20101022202357.344354364@joe.akashicho.tokyo.vergenet.net>

On Fri, Oct 22, 2010 at 10:09:37PM +0200, Simon Horman wrote:
> 
> This patch just contains ip_vs_conn.c
> and does the normal
>  - moving to vars to struct ipvs
>  - adding per netns init and exit
> 
> proc_fs required some extra work with adding/chaning private data to get the net ptr.
> 
> Signed-off-by:Hans Schillstrom <hans.schillstrom@ericsson.com>

Sorry, I messed this patch up a bit and will repost.

* I still have not addressed any of the problems beyond the
  original scope of my post, which was to rebase Hans's changes.
  In particular I have not addressed any of the issues that
  Julian raised in response to my patches. Hans, are you planning
  to look into that or should I take another stab at things?

^ permalink raw reply

* Re: [rfc v2.1 03/10] ipvs network name space aware: conn
From: Simon Horman @ 2010-10-26 22:36 UTC (permalink / raw)
  To: lvs-devel, netdev, netfilter-devel
  Cc: Hans Schillstrom, Julian Anastasov, Daniel Lezcano, Wensong Zhang

This patch just contains ip_vs_conn.c
and does the normal
 - moving to vars to struct ipvs
 - adding per netns init and exit

proc_fs required some extra work with adding/chaning private data to get the net ptr.

Signed-off-by: Hans Schillstrom <hans.schillstrom@ericsson.com>

--- 

* v2
  Rebase against current nf-next-2.6 (Simon Horman)

* v2.1
  Fix patch-brokenness in v2 (Simon Horman)

Index: lvs-test-2.6/net/netfilter/ipvs/ip_vs_conn.c
===================================================================
--- lvs-test-2.6.orig/net/netfilter/ipvs/ip_vs_conn.c	2010-10-27 06:05:11.000000000 +0900
+++ lvs-test-2.6/net/netfilter/ipvs/ip_vs_conn.c	2010-10-27 06:15:01.000000000 +0900
@@ -56,23 +56,12 @@ MODULE_PARM_DESC(conn_tab_bits, "Set con
 int ip_vs_conn_tab_size;
 int ip_vs_conn_tab_mask;
 
-/*
- *  Connection hash table: for input and output packets lookups of IPVS
- */
-static struct list_head *ip_vs_conn_tab;
-
-/*  SLAB cache for IPVS connections */
-static struct kmem_cache *ip_vs_conn_cachep __read_mostly;
-
-/*  counter for current IPVS connections */
-static atomic_t ip_vs_conn_count = ATOMIC_INIT(0);
-
-/*  counter for no client port connections */
-static atomic_t ip_vs_conn_no_cport_cnt = ATOMIC_INIT(0);
-
 /* random value for IPVS connection hash */
 static unsigned int ip_vs_conn_rnd;
 
+/* cache name cnt */
+static atomic_t conn_cache_nr = ATOMIC_INIT(0);
+
 /*
  *  Fine locking granularity for big connection hash table
  */
@@ -173,8 +162,8 @@ static unsigned int ip_vs_conn_hashkey_c
 {
 	struct ip_vs_conn_param p;
 
-	ip_vs_conn_fill_param(cp->af, cp->protocol, &cp->caddr, cp->cport,
-			      NULL, 0, &p);
+	ip_vs_conn_fill_param(NULL, cp->af, cp->protocol, &cp->caddr,
+			      cp->cport, NULL, 0, &p);
 
 	if (cp->dest && cp->dest->svc->pe) {
 		p.pe = cp->dest->svc->pe;
@@ -189,7 +178,7 @@ static unsigned int ip_vs_conn_hashkey_c
  *	Hashes ip_vs_conn in ip_vs_conn_tab by proto,addr,port.
  *	returns bool success.
  */
-static inline int ip_vs_conn_hash(struct ip_vs_conn *cp)
+static inline int ip_vs_conn_hash(struct net *net, struct ip_vs_conn *cp)
 {
 	unsigned hash;
 	int ret;
@@ -204,7 +193,7 @@ static inline int ip_vs_conn_hash(struct
 	spin_lock(&cp->lock);
 
 	if (!(cp->flags & IP_VS_CONN_F_HASHED)) {
-		list_add(&cp->c_list, &ip_vs_conn_tab[hash]);
+		list_add(&cp->c_list, &net->ipvs->conn_tab[hash]);
 		cp->flags |= IP_VS_CONN_F_HASHED;
 		atomic_inc(&cp->refcnt);
 		ret = 1;
@@ -262,12 +251,13 @@ __ip_vs_conn_in_get(const struct ip_vs_c
 {
 	unsigned hash;
 	struct ip_vs_conn *cp;
+	struct netns_ipvs *ipvs = p->net->ipvs;
 
 	hash = ip_vs_conn_hashkey_param(p, false);
 
 	ct_read_lock(hash);
 
-	list_for_each_entry(cp, &ip_vs_conn_tab[hash], c_list) {
+	list_for_each_entry(cp, &ipvs->conn_tab[hash], c_list) {
 		if (cp->af == p->af &&
 		    ip_vs_addr_equal(p->af, p->caddr, &cp->caddr) &&
 		    ip_vs_addr_equal(p->af, p->vaddr, &cp->vaddr) &&
@@ -286,12 +276,13 @@ __ip_vs_conn_in_get(const struct ip_vs_c
 	return NULL;
 }
 
-struct ip_vs_conn *ip_vs_conn_in_get(const struct ip_vs_conn_param *p)
+struct ip_vs_conn *
+ip_vs_conn_in_get(const struct ip_vs_conn_param *p)
 {
 	struct ip_vs_conn *cp;
 
 	cp = __ip_vs_conn_in_get(p);
-	if (!cp && atomic_read(&ip_vs_conn_no_cport_cnt)) {
+	if (!cp && atomic_read(&p->net->ipvs->conn_no_cport_cnt)) {
 		struct ip_vs_conn_param cport_zero_p = *p;
 		cport_zero_p.cport = 0;
 		cp = __ip_vs_conn_in_get(&cport_zero_p);
@@ -313,16 +304,19 @@ ip_vs_conn_fill_param_proto(int af, cons
 			    struct ip_vs_conn_param *p)
 {
 	__be16 _ports[2], *pptr;
+	struct net *net = dev_net(skb->dev);
 
 	pptr = skb_header_pointer(skb, proto_off, sizeof(_ports), _ports);
 	if (pptr == NULL)
 		return 1;
 
 	if (likely(!inverse))
-		ip_vs_conn_fill_param(af, iph->protocol, &iph->saddr, pptr[0],
+		ip_vs_conn_fill_param(net, af, iph->protocol,
+				      &iph->saddr, pptr[0],
 				      &iph->daddr, pptr[1], p);
 	else
-		ip_vs_conn_fill_param(af, iph->protocol, &iph->daddr, pptr[1],
+		ip_vs_conn_fill_param(net, af, iph->protocol,
+				      &iph->daddr, pptr[1],
 				      &iph->saddr, pptr[0], p);
 	return 0;
 }
@@ -347,12 +341,13 @@ struct ip_vs_conn *ip_vs_ct_in_get(const
 {
 	unsigned hash;
 	struct ip_vs_conn *cp;
+	struct netns_ipvs *ipvs = p->net->ipvs;
 
 	hash = ip_vs_conn_hashkey_param(p, false);
 
 	ct_read_lock(hash);
 
-	list_for_each_entry(cp, &ip_vs_conn_tab[hash], c_list) {
+	list_for_each_entry(cp, &ipvs->conn_tab[hash], c_list) {
 		if (p->pe_data && p->pe->ct_match) {
 			if (p->pe->ct_match(p, cp))
 				goto out;
@@ -394,6 +389,7 @@ struct ip_vs_conn *ip_vs_conn_out_get(co
 {
 	unsigned hash;
 	struct ip_vs_conn *cp, *ret=NULL;
+	struct netns_ipvs *ipvs = p->net->ipvs;
 
 	/*
 	 *	Check for "full" addressed entries
@@ -402,7 +398,7 @@ struct ip_vs_conn *ip_vs_conn_out_get(co
 
 	ct_read_lock(hash);
 
-	list_for_each_entry(cp, &ip_vs_conn_tab[hash], c_list) {
+	list_for_each_entry(cp, &ipvs->conn_tab[hash], c_list) {
 		if (cp->af == p->af &&
 		    ip_vs_addr_equal(p->af, p->vaddr, &cp->caddr) &&
 		    ip_vs_addr_equal(p->af, p->caddr, &cp->daddr) &&
@@ -457,19 +453,19 @@ void ip_vs_conn_put(struct ip_vs_conn *c
 /*
  *	Fill a no_client_port connection with a client port number
  */
-void ip_vs_conn_fill_cport(struct ip_vs_conn *cp, __be16 cport)
+void ip_vs_conn_fill_cport(struct net *net, struct ip_vs_conn *cp, __be16 cport)
 {
 	if (ip_vs_conn_unhash(cp)) {
 		spin_lock(&cp->lock);
 		if (cp->flags & IP_VS_CONN_F_NO_CPORT) {
-			atomic_dec(&ip_vs_conn_no_cport_cnt);
+			atomic_dec(&net->ipvs->conn_no_cport_cnt);
 			cp->flags &= ~IP_VS_CONN_F_NO_CPORT;
 			cp->cport = cport;
 		}
 		spin_unlock(&cp->lock);
 
 		/* hash on new dport */
-		ip_vs_conn_hash(cp);
+		ip_vs_conn_hash(net, cp);
 	}
 }
 
@@ -606,12 +602,12 @@ ip_vs_bind_dest(struct ip_vs_conn *cp, s
  * Check if there is a destination for the connection, if so
  * bind the connection to the destination.
  */
-struct ip_vs_dest *ip_vs_try_bind_dest(struct ip_vs_conn *cp)
+struct ip_vs_dest *ip_vs_try_bind_dest(struct net *net, struct ip_vs_conn *cp)
 {
 	struct ip_vs_dest *dest;
 
 	if ((cp) && (!cp->dest)) {
-		dest = ip_vs_find_dest(cp->af, &cp->daddr, cp->dport,
+		dest = ip_vs_find_dest(net, cp->af, &cp->daddr, cp->dport,
 				       &cp->vaddr, cp->vport,
 				       cp->protocol);
 		ip_vs_bind_dest(cp, dest);
@@ -683,7 +679,7 @@ static inline void ip_vs_unbind_dest(str
  *	If available, return 1, otherwise invalidate this connection
  *	template and return 0.
  */
-int ip_vs_check_template(struct ip_vs_conn *ct)
+int ip_vs_check_template(struct net *net, struct ip_vs_conn *ct)
 {
 	struct ip_vs_dest *dest = ct->dest;
 
@@ -692,7 +688,7 @@ int ip_vs_check_template(struct ip_vs_co
 	 */
 	if ((dest == NULL) ||
 	    !(dest->flags & IP_VS_DEST_F_AVAILABLE) ||
-	    (sysctl_ip_vs_expire_quiescent_template &&
+	    (net->ipvs->sysctl_expire_quiescent_template &&
 	     (atomic_read(&dest->weight) == 0))) {
 		IP_VS_DBG_BUF(9, "check_template: dest not available for "
 			      "protocol %s s:%s:%d v:%s:%d "
@@ -713,7 +709,7 @@ int ip_vs_check_template(struct ip_vs_co
 				ct->dport = htons(0xffff);
 				ct->vport = htons(0xffff);
 				ct->cport = 0;
-				ip_vs_conn_hash(ct);
+				ip_vs_conn_hash(net, ct);
 			}
 		}
 
@@ -770,15 +766,15 @@ static void ip_vs_conn_expire(unsigned l
 			ip_vs_unbind_app(cp);
 		ip_vs_unbind_dest(cp);
 		if (cp->flags & IP_VS_CONN_F_NO_CPORT)
-			atomic_dec(&ip_vs_conn_no_cport_cnt);
-		atomic_dec(&ip_vs_conn_count);
+			atomic_dec(&cp->net->ipvs->conn_no_cport_cnt);
+		atomic_dec(&cp->net->ipvs->conn_count);
 
-		kmem_cache_free(ip_vs_conn_cachep, cp);
+		kmem_cache_free(cp->net->ipvs->conn_cachep, cp);
 		return;
 	}
 
 	/* hash it back to the table */
-	ip_vs_conn_hash(cp);
+	ip_vs_conn_hash(cp->net, cp);
 
   expire_later:
 	IP_VS_DBG(7, "delayed: conn->refcnt-1=%d conn->n_control=%d\n",
@@ -795,9 +791,9 @@ void ip_vs_conn_expire_now(struct ip_vs_
 		mod_timer(&cp->timer, jiffies);
 }
 
-
 /*
- *	Create a new connection entry and hash it into the ip_vs_conn_tab
+ *	Create a new connection entry and hash it into the ip_vs_conn_tab,
+ *	netns ptr will be stored in ip_vs_con here.
  */
 struct ip_vs_conn *
 ip_vs_conn_new(const struct ip_vs_conn_param *p,
@@ -805,9 +801,12 @@ ip_vs_conn_new(const struct ip_vs_conn_p
 	       struct ip_vs_dest *dest)
 {
 	struct ip_vs_conn *cp;
-	struct ip_vs_protocol *pp = ip_vs_proto_get(p->protocol);
+	struct ip_vs_proto_data *pd = ip_vs_proto_data_get(p->net,
+							   p->protocol);
+	struct ip_vs_protocol *pp;
+	struct netns_ipvs *ipvs = p->net->ipvs;
 
-	cp = kmem_cache_zalloc(ip_vs_conn_cachep, GFP_ATOMIC);
+	cp = kmem_cache_zalloc(ipvs->conn_cachep, GFP_ATOMIC);
 	if (cp == NULL) {
 		IP_VS_ERR_RL("%s(): no memory\n", __func__);
 		return NULL;
@@ -842,9 +841,9 @@ ip_vs_conn_new(const struct ip_vs_conn_p
 	atomic_set(&cp->n_control, 0);
 	atomic_set(&cp->in_pkts, 0);
 
-	atomic_inc(&ip_vs_conn_count);
+	atomic_inc(&ipvs->conn_count);
 	if (flags & IP_VS_CONN_F_NO_CPORT)
-		atomic_inc(&ip_vs_conn_no_cport_cnt);
+		atomic_inc(&ipvs->conn_no_cport_cnt);
 
 	/* Bind the connection with a destination server */
 	ip_vs_bind_dest(cp, dest);
@@ -861,8 +860,12 @@ ip_vs_conn_new(const struct ip_vs_conn_p
 #endif
 		ip_vs_bind_xmit(cp);
 
-	if (unlikely(pp && atomic_read(&pp->appcnt)))
-		ip_vs_bind_app(cp, pp);
+	cp->net = p->net;	/* netns ptr  needed in timer */
+	if (pd) {
+		pp = pd->pp;
+		if (unlikely(pp && atomic_read(&pd->appcnt)))
+			ip_vs_bind_app(p->net, cp, pp);
+	}
 
 	/*
 	 * Allow conntrack to be preserved. By default, conntrack
@@ -871,15 +874,31 @@ ip_vs_conn_new(const struct ip_vs_conn_p
 	 * IP_VS_CONN_F_ONE_PACKET too.
 	 */
 
-	if (ip_vs_conntrack_enabled())
+	if (ip_vs_conntrack_enabled(p->net))
 		cp->flags |= IP_VS_CONN_F_NFCT;
 
 	/* Hash it in the ip_vs_conn_tab finally */
-	ip_vs_conn_hash(cp);
+	ip_vs_conn_hash(p->net, cp);
 
 	return cp;
 }
 
+struct ipvs_private {
+	struct seq_net_private p;
+	void *private;
+};
+
+static inline void ipvs_seq_priv_set(struct seq_file *seq, void *data)
+{
+	struct ipvs_private *ipriv=(struct ipvs_private *)seq->private;
+	ipriv->private = data;
+}
+
+static inline void *ipvs_seq_priv_get(struct seq_file *seq)
+{
+	return ((struct ipvs_private *)seq->private)->private;
+}
+
 /*
  *	/proc/net/ip_vs_conn entries
  */
@@ -889,13 +908,15 @@ static void *ip_vs_conn_array(struct seq
 {
 	int idx;
 	struct ip_vs_conn *cp;
+	struct net *net = seq_file_net(seq);
+	struct netns_ipvs *ipvs = net->ipvs;
 
 	for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
 		ct_read_lock_bh(idx);
-		list_for_each_entry(cp, &ip_vs_conn_tab[idx], c_list) {
+		list_for_each_entry(cp, &ipvs->conn_tab[idx], c_list) {
 			if (pos-- == 0) {
-				seq->private = &ip_vs_conn_tab[idx];
-			return cp;
+				ipvs_seq_priv_set(seq, &ipvs->conn_tab[idx]);
+				return cp;
 			}
 		}
 		ct_read_unlock_bh(idx);
@@ -906,15 +927,17 @@ static void *ip_vs_conn_array(struct seq
 
 static void *ip_vs_conn_seq_start(struct seq_file *seq, loff_t *pos)
 {
-	seq->private = NULL;
+	ipvs_seq_priv_set(seq, NULL);
 	return *pos ? ip_vs_conn_array(seq, *pos - 1) :SEQ_START_TOKEN;
 }
-
+ /* netns: conn_tab OK */
 static void *ip_vs_conn_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 {
 	struct ip_vs_conn *cp = v;
-	struct list_head *e, *l = seq->private;
+	struct list_head *e, *l = ipvs_seq_priv_get(seq);
 	int idx;
+	struct net *net = seq_file_net(seq);
+	struct netns_ipvs *ipvs = net->ipvs;
 
 	++*pos;
 	if (v == SEQ_START_TOKEN)
@@ -924,27 +947,28 @@ static void *ip_vs_conn_seq_next(struct
 	if ((e = cp->c_list.next) != l)
 		return list_entry(e, struct ip_vs_conn, c_list);
 
-	idx = l - ip_vs_conn_tab;
+	idx = l - ipvs->conn_tab;
 	ct_read_unlock_bh(idx);
 
 	while (++idx < ip_vs_conn_tab_size) {
 		ct_read_lock_bh(idx);
-		list_for_each_entry(cp, &ip_vs_conn_tab[idx], c_list) {
-			seq->private = &ip_vs_conn_tab[idx];
+		list_for_each_entry(cp, &ipvs->conn_tab[idx], c_list) {
+			ipvs_seq_priv_set(seq, &ipvs->conn_tab[idx]);
 			return cp;
 		}
 		ct_read_unlock_bh(idx);
 	}
-	seq->private = NULL;
+	ipvs_seq_priv_set(seq, NULL);
 	return NULL;
 }
-
+/* netns: conn_tab OK */
 static void ip_vs_conn_seq_stop(struct seq_file *seq, void *v)
 {
-	struct list_head *l = seq->private;
+	struct list_head *l = ipvs_seq_priv_get(seq);
+	struct net *net = seq_file_net(seq);
 
 	if (l)
-		ct_read_unlock_bh(l - ip_vs_conn_tab);
+		ct_read_unlock_bh(l - net->ipvs->conn_tab);
 }
 
 static int ip_vs_conn_seq_show(struct seq_file *seq, void *v)
@@ -1004,7 +1028,16 @@ static const struct seq_operations ip_vs
 
 static int ip_vs_conn_open(struct inode *inode, struct file *file)
 {
-	return seq_open(file, &ip_vs_conn_seq_ops);
+	int ret;
+	struct ipvs_private *priv;
+
+	ret = seq_open_net(inode, file, &ip_vs_conn_seq_ops,
+			   sizeof(struct ipvs_private));
+	if (!ret) {
+		priv = ((struct seq_file *)file->private_data)->private;
+		priv->private = NULL;
+	}
+	return ret;
 }
 
 static const struct file_operations ip_vs_conn_fops = {
@@ -1012,7 +1045,8 @@ static const struct file_operations ip_v
 	.open    = ip_vs_conn_open,
 	.read    = seq_read,
 	.llseek  = seq_lseek,
-	.release = seq_release,
+	.release = seq_release_private,
+
 };
 
 static const char *ip_vs_origin_name(unsigned flags)
@@ -1067,7 +1101,17 @@ static const struct seq_operations ip_vs
 
 static int ip_vs_conn_sync_open(struct inode *inode, struct file *file)
 {
-	return seq_open(file, &ip_vs_conn_sync_seq_ops);
+	int ret;
+	struct ipvs_private *ipriv;
+
+	ret = seq_open_net(inode, file, &ip_vs_conn_sync_seq_ops,
+			   sizeof(struct ipvs_private));
+	if (!ret) {
+		ipriv = ((struct seq_file *)file->private_data)->private;
+		ipriv->private = NULL;
+	}
+	return ret;
+//	return seq_open(file, &ip_vs_conn_sync_seq_ops);
 }
 
 static const struct file_operations ip_vs_conn_sync_fops = {
@@ -1075,7 +1119,7 @@ static const struct file_operations ip_v
 	.open    = ip_vs_conn_sync_open,
 	.read    = seq_read,
 	.llseek  = seq_lseek,
-	.release = seq_release,
+	.release = seq_release_private,
 };
 
 #endif
@@ -1112,11 +1156,14 @@ static inline int todrop_entry(struct ip
 	return 1;
 }
 
-/* Called from keventd and must protect itself from softirqs */
-void ip_vs_random_dropentry(void)
+/* Called from keventd and must protect itself from softirqs
+ * netns: conn_tab OK
+ */
+void ip_vs_random_dropentry(struct net *net)
 {
 	int idx;
 	struct ip_vs_conn *cp;
+	struct netns_ipvs *ipvs = net->ipvs;
 
 	/*
 	 * Randomly scan 1/32 of the whole table every second
@@ -1129,7 +1176,7 @@ void ip_vs_random_dropentry(void)
 		 */
 		ct_write_lock_bh(hash);
 
-		list_for_each_entry(cp, &ip_vs_conn_tab[hash], c_list) {
+		list_for_each_entry(cp, &ipvs->conn_tab[hash], c_list) {
 			if (cp->flags & IP_VS_CONN_F_TEMPLATE)
 				/* connection template */
 				continue;
@@ -1167,11 +1214,13 @@ void ip_vs_random_dropentry(void)
 
 /*
  *      Flush all the connection entries in the ip_vs_conn_tab
+ * netns: conn_tab OK
  */
-static void ip_vs_conn_flush(void)
+static void ip_vs_conn_flush(struct net *net)
 {
 	int idx;
 	struct ip_vs_conn *cp;
+	struct netns_ipvs *ipvs = net->ipvs;
 
   flush_again:
 	for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
@@ -1180,7 +1229,7 @@ static void ip_vs_conn_flush(void)
 		 */
 		ct_write_lock_bh(idx);
 
-		list_for_each_entry(cp, &ip_vs_conn_tab[idx], c_list) {
+		list_for_each_entry(cp, &ipvs->conn_tab[idx], c_list) {
 
 			IP_VS_DBG(4, "del connection\n");
 			ip_vs_conn_expire_now(cp);
@@ -1194,16 +1243,17 @@ static void ip_vs_conn_flush(void)
 
 	/* the counter may be not NULL, because maybe some conn entries
 	   are run by slow timer handler or unhashed but still referred */
-	if (atomic_read(&ip_vs_conn_count) != 0) {
+	if (atomic_read(&ipvs->conn_count) != 0) {
 		schedule();
 		goto flush_again;
 	}
 }
 
 
-int __init ip_vs_conn_init(void)
+int __net_init __ip_vs_conn_init(struct net *net)
 {
 	int idx;
+	struct netns_ipvs *ipvs = net->ipvs;
 
 	/* Compute size and mask */
 	ip_vs_conn_tab_size = 1 << ip_vs_conn_tab_bits;
@@ -1212,19 +1262,26 @@ int __init ip_vs_conn_init(void)
 	/*
 	 * Allocate the connection hash table and initialize its list heads
 	 */
-	ip_vs_conn_tab = vmalloc(ip_vs_conn_tab_size *
+	ipvs->conn_tab = vmalloc(ip_vs_conn_tab_size *
 				 sizeof(struct list_head));
-	if (!ip_vs_conn_tab)
+	if (!ipvs->conn_tab)
 		return -ENOMEM;
 
 	/* Allocate ip_vs_conn slab cache */
-	ip_vs_conn_cachep = kmem_cache_create("ip_vs_conn",
+	/* Todo: find a better way to name the cache */
+	snprintf(ipvs->conn_cname, sizeof(ipvs->conn_cname)-1,
+			"ipvs_conn_%d", atomic_read(&conn_cache_nr) );
+	atomic_inc(&conn_cache_nr);
+
+	ipvs->conn_cachep = kmem_cache_create(ipvs->conn_cname,
 					      sizeof(struct ip_vs_conn), 0,
 					      SLAB_HWCACHE_ALIGN, NULL);
-	if (!ip_vs_conn_cachep) {
-		vfree(ip_vs_conn_tab);
+	if (!ipvs->conn_cachep) {
+		vfree(ipvs->conn_tab);
 		return -ENOMEM;
 	}
+	atomic_set(&ipvs->conn_count, 0);
+	atomic_set(&ipvs->conn_no_cport_cnt, 0);
 
 	pr_info("Connection hash table configured "
 		"(size=%d, memory=%ldKbytes)\n",
@@ -1234,31 +1291,46 @@ int __init ip_vs_conn_init(void)
 		  sizeof(struct ip_vs_conn));
 
 	for (idx = 0; idx < ip_vs_conn_tab_size; idx++) {
-		INIT_LIST_HEAD(&ip_vs_conn_tab[idx]);
+		INIT_LIST_HEAD(&ipvs->conn_tab[idx]);
 	}
 
 	for (idx = 0; idx < CT_LOCKARRAY_SIZE; idx++)  {
 		rwlock_init(&__ip_vs_conntbl_lock_array[idx].l);
 	}
 
-	proc_net_fops_create(&init_net, "ip_vs_conn", 0, &ip_vs_conn_fops);
-	proc_net_fops_create(&init_net, "ip_vs_conn_sync", 0, &ip_vs_conn_sync_fops);
-
-	/* calculate the random value for connection hash */
-	get_random_bytes(&ip_vs_conn_rnd, sizeof(ip_vs_conn_rnd));
+	proc_net_fops_create(net, "ip_vs_conn", 0, &ip_vs_conn_fops);
+	proc_net_fops_create(net, "ip_vs_conn_sync", 0, &ip_vs_conn_sync_fops);
 
 	return 0;
 }
+/* Cleanup and release all netns related ... */
+static void __net_exit __ip_vs_conn_cleanup(struct net *net) {
 
+	/* flush all the connection entries first */
+	ip_vs_conn_flush(net);
+	/* Release the empty cache */
+	kmem_cache_destroy(net->ipvs->conn_cachep);
+	proc_net_remove(net, "ip_vs_conn");
+	proc_net_remove(net, "ip_vs_conn_sync");
+	vfree(net->ipvs->conn_tab);
+}
+static struct pernet_operations ipvs_conn_ops = {
+	.init = __ip_vs_conn_init,
+	.exit = __ip_vs_conn_cleanup,
+};
 
-void ip_vs_conn_cleanup(void)
+int __init ip_vs_conn_init(void)
 {
-	/* flush all the connection entries first */
-	ip_vs_conn_flush();
+	int rv;
 
-	/* Release the empty cache */
-	kmem_cache_destroy(ip_vs_conn_cachep);
-	proc_net_remove(&init_net, "ip_vs_conn");
-	proc_net_remove(&init_net, "ip_vs_conn_sync");
-	vfree(ip_vs_conn_tab);
+	rv = register_pernet_subsys(&ipvs_conn_ops);
+
+	/* calculate the random value for connection hash */
+	get_random_bytes(&ip_vs_conn_rnd, sizeof(ip_vs_conn_rnd));
+	return rv;
+}
+
+void ip_vs_conn_cleanup(void)
+{
+	unregister_pernet_subsys(&ipvs_conn_ops);
 }
Index: lvs-test-2.6/include/net/ip_vs.h
===================================================================
--- lvs-test-2.6.orig/include/net/ip_vs.h	2010-10-27 06:05:30.000000000 +0900
+++ lvs-test-2.6/include/net/ip_vs.h	2010-10-27 07:30:20.000000000 +0900
@@ -1066,9 +1066,9 @@ static inline void ip_vs_notrack(struct
  *      Netfilter connection tracking
  *      (from ip_vs_nfct.c)
  */
-static inline int ip_vs_conntrack_enabled(void)
+static inline int ip_vs_conntrack_enabled(struct net *net)
 {
-	return sysctl_ip_vs_conntrack;
+	return net->ipvs->sysctl_conntrack;
 }
 
 extern void ip_vs_update_conntrack(struct sk_buff *skb, struct ip_vs_conn *cp,
@@ -1081,7 +1081,7 @@ extern void ip_vs_conn_drop_conntrack(st
 
 #else
 
-static inline int ip_vs_conntrack_enabled(void)
+static inline int ip_vs_conntrack_enabled(struct net *net)
 {
 	return 0;
 }

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox