Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next-2.6] sfc: Don't try to set filters with search depths we know won't work
From: David Miller @ 2010-10-08 17:36 UTC (permalink / raw)
  To: bhutchings; +Cc: netdev, linux-net-drivers
In-Reply-To: <1286473831.2271.26.camel@achroite.uk.solarflarecom.com>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Thu, 07 Oct 2010 18:50:31 +0100

> The filter engine will time-out and ignore filters beyond
> 200-something hops.  We also need to avoid infinite loops in
> efx_filter_search() when the table is full.
> 
> Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: Update kernel-doc for netif_set_real_num_rx_queues()
From: David Miller @ 2010-10-08 17:34 UTC (permalink / raw)
  To: bhutchings; +Cc: netdev, john.r.fastabend, therbert, eric.dumazet
In-Reply-To: <1286456891.2271.1.camel@achroite.uk.solarflarecom.com>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Thu, 07 Oct 2010 14:08:11 +0100

> Synchronise the comment with the preceding implementation change.
> 
> Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>

Applied, thanks Ben.

^ permalink raw reply

* Re: BUG ? ipip unregister_netdevice_many()
From: Eric W. Biederman @ 2010-10-08 17:32 UTC (permalink / raw)
  To: David Miller; +Cc: hans.schillstrom, daniel.lezcano, netdev
In-Reply-To: <20101008.102012.226761665.davem@davemloft.net>

David Miller <davem@davemloft.net> writes:

> From: ebiederm@xmission.com (Eric W. Biederman)
> Date: Fri, 08 Oct 2010 09:45:15 -0700
>
>> My hunch is that we have dst entry problems, as I know those hop network
>> interfaces when we destroy network devices, but I have seen weird issues
>> with the route cache as well.
>
> While we're on this topic, can someone explain to me what the special
> CONFIG_NET_NS code in net/ipv4/route.c:rt_do_flush() is trying to
> accomplish?
>
> If the issue is that there is an implicit ordering of releasing of
> 'dst' entries that must be maintained, we really ought to formalize
> it (f.e. with dependency pointers or something like that).

It is just dealing with not flushing the entire routing cache, just the
routes that have expired.  Which prevents one network namespace from
flushing it's routes and DOS'ing another.

The practical consequence is that the hash chains have to be picked
apart with some entries kept and some released based upon
rt_is_expired().

I went through it a year or so ago with a fine tooth comb and it made
sense and seems correct, but it does seem overly convoluted.

Eric

^ permalink raw reply

* Re: [PATCH] net/fec: carrier off initially to avoid root mount failure
From: David Miller @ 2010-10-08 17:31 UTC (permalink / raw)
  To: oskar; +Cc: netdev, dan, bigeasy, hjk
In-Reply-To: <1286454630-7396-1-git-send-email-oskar@linutronix.de>

From: Oskar Schirmer <oskar@linutronix.de>
Date: Thu,  7 Oct 2010 14:30:30 +0200

> with hardware slow in negotiation, the system did freeze
> while trying to mount root on nfs at boot time.
> 
> the link state has not been initialised so network stack
> tried to start transmission right away. this caused instant
> retries, as the driver solely stated business upon link down,
> rendering the system unusable.
> 
> notify carrier off initially to prevent transmission until
> phylib will report link up.
> 
> Signed-off-by: Oskar Schirmer <oskar@linutronix.de>

Maybe fs_enet_open() is a better place for this netif_carrier_off()
call?

Every open the driver probes the PHY and does phy_start().

^ permalink raw reply

* Re: BUG ? ipip unregister_netdevice_many()
From: Daniel Lezcano @ 2010-10-08 17:29 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Hans Schillstrom, netdev@vger.kernel.org
In-Reply-To: <m1k4ls1v1i.fsf@fess.ebiederm.org>

On 10/08/2010 06:58 PM, Eric W. Biederman wrote:
> Daniel Lezcano<daniel.lezcano@free.fr>  writes:
>
>    
>> On 10/08/2010 05:53 PM, Daniel Lezcano wrote:
>>      
>>> On 10/08/2010 02:28 PM, Hans Schillstrom wrote:
>>>        
>>>> Hi Eric,
>>>> Any advice how to trace this down ?
>>>> This rollback_registered_many() seems to have on the lists before...
>>>> All IPv4 and IPv6 tunnels causes this crash, all you have to do is
>>>> load the tunnel module(s)
>>>> enter a new ns and exit from it.
>>>>
>>>> Have not tested any more devices than tunnels,
>>>> I did an "ip link delete" on my macvlans before exiting the ns.
>>>>          
>>> Ah ! I succeed to reproduce it.
>>> It does not appear immediately in fact.
>>>
>>> I am trying to simplify the configuration but I am falling in the bug I
>>> talked about in the previous email.
>>>        
>> Ok, so after investigating, we just need a macvlan and specify an ipv6 address
>> for it (inside a new netns of course), and the loopback is not released. I
>> compiled out the tunnels, so they are not related to this problem I think.
>>
>> That reduces the scope of investigation :)
>>      
> This reproduces the unable to free nedevice problem?  Or the bad
> pointer in macvlan_close problem?
>    

The free netdevice problem.

For the macvlan problem, you have to create 2 macvlans on the *same* 
physical interface, move them to the network namespace and then exit 
this net_ns. You have to repeat this operation several times because 
that happen randomly after a few iterations.








^ permalink raw reply

* Re: [PATCH v2] Add Qualcomm Gobi 2000 driver.
From: David Miller @ 2010-10-08 17:28 UTC (permalink / raw)
  To: ellyjones; +Cc: netdev, jglasgow, mjg59, msb, olofj
In-Reply-To: <20101006151208.GB1571@google.com>

From: Elly Jones <ellyjones@google.com>
Date: Wed, 6 Oct 2010 11:12:09 -0400

> From: Elizabeth Jones <ellyjones@google.com>
> 
> This driver is a rewrite of the original Qualcomm GPL driver, released as part
> of Qualcomm's "Code Aurora" initiative. The driver has been transformed into
> Linux kernel style and made to use kernel APIs where appropriate; some bugs have
> also been fixed. Note that the device in question requires firmware and a
> firmware loader; the latter has been written by mjg (see
> http://www.codon.org.uk/~mjg59/gobi_loader/).
> 
> Special thanks go to Joe Perches <joe@perches.com> for major cleanup.
> 
> Signed-off-by: Elizabeth Jones <ellyjones@google.com>
> Signed-off-by: Jason Glasgow <jglasgow@google.com>

I really think the firmware handling belongs in the kernel.

I've looked at the gobi_loader code and it's simpler than many
of the gigabit ethernet driver firmware loader sequences in
the tree already.

Requiring udev et al. magic only makes this networking device
that much harder to use from an initrd.

I understand how this might be a bit clumsy since we'd need to make a
dependency on the serial device since that is the mechanism by which
the firmware is uploaded, but really I'd like you to consider it
seriously.

^ permalink raw reply

* Re: [patch] isdn: strcpy() => strlcpy()
From: David Miller @ 2010-10-08 17:23 UTC (permalink / raw)
  To: error27; +Cc: viro, isdn, netdev, kernel-janitors
In-Reply-To: <20101006051735.GD5409@bicker>

From: Dan Carpenter <error27@gmail.com>
Date: Wed, 6 Oct 2010 07:17:35 +0200

> setup.phone and setup.eazmsn are 32 character buffers.
> rcvmsg.msg_data.byte_array is a 48 character buffer.
> sc_adapter[card]->channel[rcvmsg.phy_link_no - 1].dn is 50 chars.
> 
> The rcvmsg struct comes from the memcpy_fromio() in receivemessage().
> I guess that means it's data off the wire.  I'm not very familiar with
> this code but I don't see any reason to assume these strings are NULL
> terminated.
> 
> Also it's weird that "dn" in a 50 character buffer but we only seem to
> use 32 characters.  In drivers/isdn/sc/scioc.h, "dn" is only a 49
> character buffer.  So potentially there is still an issue there.
> 
> The important thing for now is to prevent the memory corruption.
> 
> Signed-off-by: Dan Carpenter <error27@gmail.com>

Applied, thanks Dan.

^ permalink raw reply

* Re: BUG ? ipip unregister_netdevice_many()
From: David Miller @ 2010-10-08 17:20 UTC (permalink / raw)
  To: ebiederm; +Cc: hans.schillstrom, daniel.lezcano, netdev
In-Reply-To: <m1y6a81vms.fsf@fess.ebiederm.org>

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Fri, 08 Oct 2010 09:45:15 -0700

> My hunch is that we have dst entry problems, as I know those hop network
> interfaces when we destroy network devices, but I have seen weird issues
> with the route cache as well.

While we're on this topic, can someone explain to me what the special
CONFIG_NET_NS code in net/ipv4/route.c:rt_do_flush() is trying to
accomplish?

If the issue is that there is an implicit ordering of releasing of
'dst' entries that must be maintained, we really ought to formalize
it (f.e. with dependency pointers or something like that).

^ permalink raw reply

* Re: powerpc, fs_enet: scanning PHY after Linux is up
From: Grant Likely @ 2010-10-08 17:06 UTC (permalink / raw)
  To: Holger brunck; +Cc: hs, linuxppc-dev, netdev, devicetree-discuss, Detlev Zundel
In-Reply-To: <4CAEDB6A.70600@keymile.com>

On Fri, Oct 08, 2010 at 10:50:50AM +0200, Holger brunck wrote:
> Hi Grant,
> 
> On 10/06/2010 06:52 PM, Grant Likely wrote:
> > On Wed, Oct 6, 2010 at 3:53 AM, Heiko Schocher <hs@denx.de> wrote:
> >>>> So, the question is, is there a possibility to solve this problem?
> >>>>
> >>>> If there is no standard option, what would be with adding a
> >>>> "scan_phy" file in
> >>>>
> >>>> /proc/device-tree/soc\@f0000000/cpm\@119c0/mdio\@10d40
> >>>> (or better destination?)
> >>>>
> >>>> which with we could rescan a PHY with
> >>>> "echo addr > /proc/device-tree/soc\@f0000000/cpm\@119c0/mdio\@10d40/scan_phy"
> >>>> (so there is no need for using of_find_node_by_path(), as we should
> >>>>  have the associated device node here, and can step through the child
> >>>>  nodes with "for_each_child_of_node(np, child)" and check if reg == addr)
> >>>>
> >>>> or shouldn;t be at least, if the phy couldn;t be found when opening
> >>>> the port, retrigger a scanning, if the phy now is accessible?
> >>>
> >>> One option would be to still register a phy_device for each phy
> >>> described in the device tree, but defer binding a driver to each phy
> >>> that doesn't respond.  Then at of_phy_find_device() time, if it
> >>
> >> Maybe I din;t get the trick, but the problem is, that
> >> you can;t register a phy_device in drivers/of/of_mdio.c
> >> of_mdiobus_register(), if the phy didn;t respond with the
> >> phy_id ... and of_phy_find_device() is not (yet) used in fs_enet
> > 
> > I'm suggesting modifying the phy layer so that it is possible to
> > register a phy_device that doesn't (yet) respond.
> > 
> 
> yes this sounds reasonable.
> 
> >>> matches with a phy_device that isn't bound to a driver yet, then
> >>> re-trigger the binding operation.  At which point the phy id can be
> >>> probed and the correct driver can be chosen.  If binding succeeds,
> >>> then return the phy_device handle.  If not, then fail as it currently
> >>> does.
> >>
> >> Wouldn;t it be good, just if we need a PHY (on calling fs_enet_open)
> >> to look if there is one?
> >>
> >> Something like that (not tested):
> >>
> >> in drivers/net/fs_enet/fs_enet-main.c in fs_init_phy()
> >> called from fs_enet_open():
> >>
> >> Do first:
> >> phydev =  of_phy_find_device(fep->fpi->phy_node);
> >>
> >> Look if there is a driver (phy_dev->drv == NULL ?)
> >>
> >> If not, call new function
> >> of_mdiobus_register_phy(mii_bus, fep->fpi->phy_node)
> >> see below patch for it.
> >>
> >> If this succeeds, all is OK, and we can use this phy,
> >> else ethernet not work.
> > 
> > I don't like this approach because it muddies the concept of which
> > device is actually responsible for managing the phys on the bus.  Is
> > it managed by the mdio bus device or the Ethernet device?  It also has
> > a potential race condition.  Whereas triggering a late driver bind
> > will be safe.
> > 
> > Alternately, I'd also be okay with a common method to trigger a
> > reprobe of a particular phy from userspace, but I fear that would be a
> > significantly more complex solution.
> > 
> >>
> >> !!just no idea, how to get mii_bus pointer ...
> > 
> > You'd have to get the parent of the phy node, and then loop over all
> > the registered mdio busses looking for a bus that uses that node.
> > 
> 
> you say that you don't like the approach to probe the phy again in fs_enet_open,
> but currently I don't understand what would be the alternate trigger point to
> rescan the mdio bus?

Same trigger point, but different operation.  At fs_enet_open time,
instead of registering the phy_device, the phy layer could sanity
check the already registered phy_device, and refuse to connect to it
if the phy isn't responding.  If it is responding, then it could
re-attempt binding a phy_driver to it (although I just realized that
this has other problems, such as correct module loading.  See below)

> I made a first patch to enhance the phy_device structure and rescan the mdio bus
> at time of fs_enet_open (because I didn't see a better trigger point). The
> advantage is that we got the mii_bus pointer and the phy addr stored in the
> already created phy device structure and is therefore easy to use. See the patch
> below for this modifications. Whats currently missing in the patch is to set the
> phy_id if the phy was scanned later after phy_device creation. For the mgcoge
> board it seems to solve our problem, but maybe I miss something important.
> 
> Best regards
> Holger Brunck
> 
> diff --git a/drivers/net/fs_enet/fs_enet-main.c b/drivers/net/fs_enet/fs_enet-main.c
> index ec2f503..6bc117f 100644
> --- a/drivers/net/fs_enet/fs_enet-main.c
> +++ b/drivers/net/fs_enet/fs_enet-main.c
> @@ -775,7 +774,8 @@ static int fs_enet_open(struct net_device *dev)
>  {
>         struct fs_enet_private *fep = netdev_priv(dev);
>         int r;
> -       int err;
> +       int err = 0;
> +       u32 phy_id = 0;
> 
>         /* to initialize the fep->cur_rx,... */
>         /* not doing this, will cause a crash in fs_enet_rx_napi */
> @@ -795,13 +795,23 @@ static int fs_enet_open(struct net_device *dev)
>                 return -EINVAL;
>         }
> 
> -       err = fs_init_phy(dev);
> -       if (err) {
> +       if (fep->phydev == NULL)
> +               err = fs_init_phy(dev);
> +
> +       if (!err && (fep->phydev->available == false))
> +               r = get_phy_id(fep->phydev->bus, fep->phydev->addr, &phy_id);
> +
> +       if (err || (phy_id == 0xffffffff)) {
>                 free_irq(fep->interrupt, dev);
>                 if (fep->fpi->use_napi)
>                         napi_disable(&fep->napi);
> -               return err;
> +               if (err)
> +                       return err;
> +               else
> +                       return -EINVAL;
>         }
> +       else
> +               fep->phydev->available = true;
>         phy_start(fep->phydev);
> 
>         netif_start_queue(dev);
> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> index adbc0fd..1f443cb 100644
> --- a/drivers/net/phy/phy_device.c
> +++ b/drivers/net/phy/phy_device.c
> @@ -173,6 +173,10 @@ struct phy_device* phy_device_create(struct mii_bus *bus,
> int addr, int phy_id)
>         dev->dev.bus = &mdio_bus_type;
>         dev->irq = bus->irq != NULL ? bus->irq[addr] : PHY_POLL;
>         dev_set_name(&dev->dev, PHY_ID_FMT, bus->id, addr);
> +       if (phy_id == 0xffffffff)
> +               dev->available = false;
> +       else
> +               dev->available = true;

This flag shouldn't be necessary.  Just check whether or not
phy_device->phy_id is sane at phy_attach_direct() time.  If it is
mostly f's, then don't attach.

> 
>         dev->state = PHY_DOWN;
> 
> @@ -232,13 +236,11 @@ struct phy_device * get_phy_device(struct mii_bus *bus,
> int addr)
>         int r;
> 
>         r = get_phy_id(bus, addr, &phy_id);
> -       if (r)
> -               return ERR_PTR(r);
> 
>         /* If the phy_id is mostly Fs, there is no device there */
> -       if ((phy_id & 0x1fffffff) == 0x1fffffff)
> -               return NULL;
> -
> +       if (((phy_id & 0x1fffffff) == 0x1fffffff) || r)
> +               phy_id = 0xffffffff;
> +       /* create phy even if the phy is currently not available */
>         dev = phy_device_create(bus, addr, phy_id);

Cannot do it this way because many phylib users probe the bus for phys
instead of the explicit creation used with the device tree.  There
needs to be a method to explicitly skip this test when creating a phy;
possibly by having the device tree code call phy_device_create()
directly.

Hmmm.... I see another problem.  Deferred probing of the phy will
potentially cause problems with module loading.  If the binding is
deferred to phy connect time; then the phy driver may not have time to
get loaded before the phy layer decides there is no driver and binds
it to the generic one.  Blech.

Okay, so it seems like a method of explicitly triggering a phy_device
rebind from userspace is necessary.  This could be done with a
per-phy_device sysfs file I suppose.  Just an empty file that when
read triggers a re-read of the phy id registers, and retries binding a
driver, including the request_module call in phy_device_create().

> 
>         return dev;
> diff --git a/include/linux/phy.h b/include/linux/phy.h
> index 6a7eb40..12dc3e4 100644
> --- a/include/linux/phy.h
> +++ b/include/linux/phy.h
> @@ -303,6 +303,9 @@ struct phy_device {
> 
>         int link_timeout;
> 
> +       /* Flag to support delayed availability */
> +       bool available;
> +
>         /*
>          * Interrupt number for this PHY
>          * -1 means no interrupt
> 

^ permalink raw reply

* [PATCH] Tarpit target for the last stable (2.6.35.7)
From: Nicola Padovano @ 2010-10-08 17:01 UTC (permalink / raw)
  To: netfilter-devel, netdev

Kernel module to capture and hold incoming TCP connections sending a
zero window packet to slow down the worm action.
More informations and newer version at:
http://npadovano.altervista.org/tarpit.html (or in the source code
comments)
Below you can find the patch and the relative userspace library
(compile it and put the .so output in /lib/xtables)

----------tarpit-2.6.35.7.patch------------
diff -rupN linux-2.6.35.7.orig/drivers/char/random.c
linux-2.6.35.7/drivers/char/random.c
--- linux-2.6.35.7.orig/drivers/char/random.c	2010-09-29
03:09:08.000000000 +0200
+++ linux-2.6.35.7/drivers/char/random.c	2010-10-08 02:51:35.000000000 +0200
@@ -1514,6 +1514,8 @@ __u32 secure_ip_id(__be32 daddr)
 	return half_md4_transform(hash, keyptr->secret);
 }

+EXPORT_SYMBOL(secure_ip_id);
+
 #ifdef CONFIG_INET

 __u32 secure_tcp_sequence_number(__be32 saddr, __be32 daddr,
@@ -1551,6 +1553,8 @@ __u32 secure_tcp_sequence_number(__be32
 	return seq;
 }

+EXPORT_SYMBOL(secure_tcp_sequence_number);
+
 /* Generate secure starting point for ephemeral IPV4 transport port search */
 u32 secure_ipv4_port_ephemeral(__be32 saddr, __be32 daddr, __be16 dport)
 {
diff -rupN linux-2.6.35.7.orig/net/netfilter/Kconfig
linux-2.6.35.7/net/netfilter/Kconfig
--- linux-2.6.35.7.orig/net/netfilter/Kconfig	2010-09-29
03:09:08.000000000 +0200
+++ linux-2.6.35.7/net/netfilter/Kconfig	2010-10-08 12:51:07.000000000 +0200
@@ -548,6 +548,32 @@ config NETFILTER_XT_TARGET_SECMARK

 	  To compile it as a module, choose M here.  If unsure, say N.

+config NETFILTER_XT_TARGET_TARPIT
+	tristate '"TARPIT" target support'
+	depends on NETFILTER_XTABLES
+	default m if NETFILTER_ADVANCED=n
+	---help---
+	  Adds a TARPIT target to iptables, which captures and holds
+	  incoming TCP connections using no local per-connection resources.
+	  Connections are accepted, but immediately switched to the persist
+	  state (0 byte window), in which the remote side stops sending data
+	  and asks to continue every 60-240 seconds (with a probe packet).
+	  Attempts to close the connection are ignored, forcing the remote
+	  side to time out the connection in 12-24 minutes. Any TCP port that
+	  you would normally DROP or REJECT can instead become a tarpit.
+	  Example:
+
+	  1) iptables -A INPUT -p tcp --dport 1111 -j TARPIT
+	  2) iptables -A FORWARD -p tcp -s x.y.z.k --dport 1111 -j TARPIT
+
+ 	  In the first example all the incoming TCP packets sent to the port
+	  1111 will be tarpitted. In the second one, we're tarpitting all the
+	  forwarded packets sent to the host x.y.z.k on the port 1111.
+
+         You can find more informations and newer versions at:
+	  <http://npadovano.altervista.org/tarpit.html>. Reporting bugs and
+	  improvements are welcome.
+
 config NETFILTER_XT_TARGET_TCPMSS
 	tristate '"TCPMSS" target support'
 	depends on (IPV6 || IPV6=n)
diff -rupN linux-2.6.35.7.orig/net/netfilter/Makefile
linux-2.6.35.7/net/netfilter/Makefile
--- linux-2.6.35.7.orig/net/netfilter/Makefile	2010-09-29
03:09:08.000000000 +0200
+++ linux-2.6.35.7/net/netfilter/Makefile	2010-10-08 02:53:44.000000000 +0200
@@ -56,6 +56,7 @@ obj-$(CONFIG_NETFILTER_XT_TARGET_NFQUEUE
 obj-$(CONFIG_NETFILTER_XT_TARGET_NOTRACK) += xt_NOTRACK.o
 obj-$(CONFIG_NETFILTER_XT_TARGET_RATEEST) += xt_RATEEST.o
 obj-$(CONFIG_NETFILTER_XT_TARGET_SECMARK) += xt_SECMARK.o
+obj-$(CONFIG_NETFILTER_XT_TARGET_TARPIT) += xt_TARPIT.o
 obj-$(CONFIG_NETFILTER_XT_TARGET_TPROXY) += xt_TPROXY.o
 obj-$(CONFIG_NETFILTER_XT_TARGET_TCPMSS) += xt_TCPMSS.o
 obj-$(CONFIG_NETFILTER_XT_TARGET_TCPOPTSTRIP) += xt_TCPOPTSTRIP.o
diff -rupN linux-2.6.35.7.orig/net/netfilter/xt_TARPIT.c
linux-2.6.35.7/net/netfilter/xt_TARPIT.c
--- linux-2.6.35.7.orig/net/netfilter/xt_TARPIT.c	1970-01-01
01:00:00.000000000 +0100
+++ linux-2.6.35.7/net/netfilter/xt_TARPIT.c	2010-10-08 13:27:54.000000000 +0200
@@ -0,0 +1,304 @@
+/*
+ * Kernel module to capture and hold incoming TCP connections using
+ * no local per-connection resources.
+ *
+ * Based on ipt_REJECT.c and offering functionality similar to
+ * LaBrea <http://labrea.sourceforge.net/>.
+ *
+ * Original idea of:
+ *     Aaron Hopkins, <tools@die.net>
+ *
+ * Version 2.6.35.7 kernel release, written by:
+ *     Nicola Padovano, <nicola.padovano@gmail.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ *
+ *
+ * Goal:
+ * - Allow incoming TCP connections to be established.
+ * - Passing data should result in the connection being switched to the
+ *   persist state (0 byte window), in which the remote side stops sending
+ *   data and asks to continue every 60-240 seconds.
+ * - Prevent spoofing (don't allow the victim to connect with server worm)
+ * - Attempts to shut down the connection should be ignored completely, so
+ *   the remote side ends up having to time it out.
+ *
+ * This means:
+ * - Reply to TCP SYN,!ACK,!RST,!FIN with SYN-ACK, window 5 bytes
+ * - Reply to TCP !SYN,!RST,!FIN with ACK, window 0 bytes, rate-limited
+ * - Reply to TCP SYN,ACK,!RST,!FIN with RST to prevent spoofing
+ * - No reply to TCP RST or FIN
+ *
+ * More informations and newer versions at:
+ * <http://npadovano.altervista.com/tarpit.html>
+ */
+
+#include <linux/version.h>
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <linux/ip.h>
+#include <net/ip.h>
+#include <net/tcp.h>
+#include <net/icmp.h>
+#include <net/route.h>
+#include <linux/random.h>
+#include <linux/netfilter_ipv4/ip_tables.h>
+
+
+MODULE_DESCRIPTION("TARPIT iptables target, info at: "
+		    "http://npadovano.altervista.org/tarpit.html");
+MODULE_AUTHOR("Nicola Padovano <nicola.padovano@gmail.com>");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS("xt_TARPIT");
+
+/*
+ * send a packet (to L2 output processing function)
+ *
+ * Note: there is also rtable* argument: it's only
+ *       used for debugging reason. Delete it if
+ *       you're sure things work.
+ */
+static int ip_direct_send(struct sk_buff* skb)
+{
+  /* get skb dst_entry pointer */
+  struct dst_entry* dst = skb_dst(skb);
+
+  /* if we already sent a packet to our neighbour */
+  if (dst->hh != NULL)
+    return neigh_hh_output(dst->hh,skb);
+
+  /* if there isn't header cache, send our packet with simple output
function */
+  else if (dst->neighbour != NULL)
+    return dst->neighbour->output(skb);
+
+  /* no neighbour, no header cache: it's impossible send the packet */
+  if (net_ratelimit())	
+    printk(KERN_DEBUG "TARPIT> ip_direct_send: no neighbour, no
header cache\n");
+
+  kfree_skb(skb);
+
+  return -EINVAL;
+}
+
+/*
+ * tarpit service routine: it accepts a pointer to
+ * sk_buff (the received packet) and its (input)
+ * cache entry.
+ * Note: the received packet it's called "old" in
+ *       the comments below. The packet to send
+ *       it's called "new".
+ */
+static void tarpit_tcp(const struct sk_buff *oskb,
+		       struct rtable *ort)
+{
+  struct iphdr    *oiph;             /* old ip header pointer */
+  struct tcphdr   *otcph;            /* old tcp header pointer */
+  unsigned int    oiphlen;           /* old ip header len */
+  unsigned int    otcplen;           /* old tcp (payload + header) len */
+
+  struct sk_buff  *nskb;             /* new skb buffer */
+  struct iphdr    *niph;             /* new ip header pointer */
+  struct tcphdr   *ntcph;            /* new tcp header pointer */
+  struct rtable   *nrt;              /* new routing table */
+
+  struct flowi fl = {};              /* search key used for the cache
lookups */
+
+  u_int32_t tmp32;                   /* swapping variables*/
+  u_int16_t tmp16;
+
+  /* fill old structs */
+  oiph    = ip_hdr(oskb);            /* get old ip header pointer */
+  oiphlen = ip_hdrlen(oskb);
+  otcph   = (void*)oiph + oiphlen;   /* do not use skb_transport_header! */
+  otcplen = oskb->len - oiphlen;
+
+  /* packets to drop */
+  if ( (oskb->len < oiphlen + sizeof(struct tcphdr)) ||
    /* truncate tcp header: 1st too fragmented packet...*/
+       (otcph->fin || otcph->rst) ||
    /* RST or FIN packet*/
+       (!otcph->ack && !otcph->syn) ||
    /* !ACK,!SYN packet */
+       (tcp_v4_check(otcplen, oiph->saddr, oiph->daddr,
    /* bad tcp checksum e ip checksum */
+		    csum_partial((char *)otcph, otcplen, 0))) ||
+       (oiph->frag_off & htons(IP_OFFSET)) ||
    /* fragment packets (after the first one) */
+       (!otcph->syn && otcph->ack && !xrlim_allow(&ort->u.dst,HZ)) ||
    /* rate-limit answer to SYN,ACK packets */
+       (ort == NULL) ||
    /* check for an input routing cache entry */
+       (oskb->pkt_type != PACKET_HOST &&
    /* no replies to physical multi/broadcast */
+        oskb->pkt_type != PACKET_OTHERHOST) ||
+       (ort->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST)) )
   /* check IP multi/broadcast */
+    return;
+
+
+  if (!(nskb = skb_copy(oskb, GFP_ATOMIC)))      /* we start with a
yet-prepared buffer */
+    return;                                      /* return if there's
an error in copy */
+
+  /* This packet will not be the same as the other: clear nf fields */
+  nf_conntrack_put(nskb->nfct);
+  nskb->nfct = NULL;
+#ifdef CONFIG_NETFILTER_DEBUG
+  nskb->nf_debug = 0;
+#endif
+
+  skb_trim(nskb, ip_hdrlen(nskb) + sizeof(struct tcphdr));     /*
trim skb size to IP LEN + TCP LEN*/
+  niph  = (struct iphdr*)skb_network_header(nskb);
+  ntcph = (void*)niph + ip_hdrlen(nskb);
+
+  /* set new ip header */
+  tmp32          = niph->saddr;
+  niph->saddr    = niph->daddr;
+  niph->daddr    = tmp32;
+
+  niph->tot_len  = htons(nskb->len);
+  niph->id       = htons(secure_ip_id(niph->daddr));
+
+  niph->frag_off = htons(IP_DF);            /* set don't frag flag */
+  niph->ttl      = (u32)64;
+
+  niph->check    = 0;
+  niph->check    = ip_fast_csum((unsigned char*)niph ,niph->ihl);
+
+  /* set new tcp header */
+  tmp16         = ntcph->source;
+  ntcph->source = ntcph->dest;
+  ntcph->dest   = tmp16;
+
+  ntcph->doff   = sizeof(struct tcphdr)/4;    /* 20bytes, doff is a
4bit variable */
+
+  ((u_int8_t*)ntcph)[13] = 0;                 /* reset all flag */
+  ntcph->urg_ptr = 0;
+
+   /* 1st goal: new tcp connection */
+  if (otcph->syn && !otcph->ack)
+    {
+      ntcph->syn = 1;
+      ntcph->ack = 1;
+      ntcph->seq = htonl(secure_tcp_sequence_number(niph->saddr,
+						    niph->daddr,
+						    ntcph->source,
+						    ntcph->dest));
+      ntcph->ack_seq = htonl(ntohl(otcph->seq) + 1);
+      ntcph->window = htons(5);
+    }
+
+  /* 2nd goal: RST tcp: RST+ACK */
+  else if (otcph->syn && otcph->ack)
+    {
+      ntcph->syn = 0;
+      ntcph->ack = 1;
+      ntcph->rst = 1;
+
+      ntcph->seq = otcph->ack_seq;
+      ntcph->ack_seq = 0;
+    }
+
+  /* 3rd goal: 0 byte window */
+  else if (!otcph->syn && otcph->ack)
+    {
+      ntcph->syn = 0;
+      ntcph->ack = 1;
+      ntcph->window = 0;
+      ntcph->seq = otcph->ack_seq;
+      ntcph->ack_seq = otcph->seq;   /* it's saying: "i'm not
received anything" */
+    }
+  else                               /* unacknowledged packet: drop it*/
+    return;
+
+  ntcph->check = 0;
+  ntcph->check = tcp_v4_check(sizeof(struct tcphdr),
+			      niph->saddr,
+			      niph->daddr,
+			      csum_partial((char *)ntcph,
+			      sizeof(struct tcphdr), 0));
+
+
+  /* create a search key to find packet route in cache*/
+  fl.nl_u.ip4_u.daddr = niph->daddr;
+  fl.nl_u.ip4_u.saddr = 0;
+  fl.nl_u.ip4_u.tos = RT_TOS(niph->tos) | RTO_CONN;
+  fl.oif = 0;
+
+  if (ip_route_output_key(&init_net, &nrt, &fl))             /* fill
nrt struct */
+    goto free_nskb;                                          /* exit
if errors in filling*/
+
+  dst_release(skb_dst(nskb));
+  nskb->_skb_refdst = ((unsigned long)&(nrt->u.dst));
+
+  ip_direct_send(nskb);                                      /* send
the packet */
+
+  return;
+
+ free_nskb:
+  kfree_skb(nskb);
+}
+
+
+/*
+ * target function, called everyone the rule is satisfied
+ * standard behaviour: NF_DROP
+ */
+static unsigned int xt_tarpit_target(struct sk_buff *skb,
+                                     const struct xt_target_param *par)
+{
+  struct rtable *rt = (void *)skb->_skb_refdst;
+  tarpit_tcp(skb,rt);
+  return NF_DROP;
+}
+
+/*
+ * xt_tarpit_check allows only:
+ * 1. raw table & PRE_ROUTING hook or
+ * 2. filter table & (LOCAL_IN or FORWARD) hook
+ * Note: for new kernels version: returns _false_
+ *       if checking is OK, otherwise returns _true_
+ */
+static bool xt_tarpit_check(const struct xt_mtchk_param *par)
+{
+  if (!strcmp(par->table, "raw") &&
+      par->hook_mask == NF_INET_PRE_ROUTING)
+    return false;
+
+  if (strcmp(par->table, "filter"))
+    return true;
+
+  return (par->hook_mask &  ~((1 << NF_INET_LOCAL_IN) |
+	                      (1 << NF_INET_FORWARD)));
+}
+
+
+static struct xt_target xt_tar_reg = {
+  .name       = "TARPIT",            /* target name */
+  .family     = AF_INET,             /* level 3 protocol */
+  .proto      = IPPROTO_TCP,         /* we recognize only tcp protocol */
+  .target     = xt_tarpit_target,    /* pointer to target function */
+  .checkentry = xt_tarpit_check,     /* pointer to check-entry function */
+  .me         = THIS_MODULE,
+};
+
+/*
+ * initing module function
+ */
+static int __init xt_tarpit_init(void)
+{
+  return xt_register_target(&xt_tar_reg);
+}
+
+/*
+ * delete module
+ */
+static void __exit xt_tarpit_exit(void)
+{
+  xt_unregister_target(&xt_tar_reg);
+}
+
+module_init(xt_tarpit_init);
+module_exit(xt_tarpit_exit);


-------------libxt_TARPIT.c------------
#include <stdio.h>
#include <getopt.h>
#include <xtables.h>

static void tarpit_tg_help(void)
{
	printf("TARPIT takes no options\n\n");
}

static int tarpit_tg_parse(int c, char **argv, int invert, unsigned int *flags,
                           const void *entry, struct xt_entry_target **target)
{
	return 0;
}

static void tarpit_tg_check(unsigned int flags)
{
}

static struct xtables_target tarpit_tg_reg = {
	.version       = XTABLES_VERSION,
	.name          = "TARPIT",
	.family        = AF_INET,
	.help          = tarpit_tg_help,
	.parse         = tarpit_tg_parse,
	.final_check   = tarpit_tg_check,
};

static __attribute__((constructor)) void tarpit_tg_ldr(void)
{
	xtables_register_target(&tarpit_tg_reg);
}

-- 
Nicola Padovano
e-mail: nicola.padovano@gmail.com
web: http://npadovano.altervista.org

"My only ambition is not be anything at all; it seems the most
sensible thing" (C. Bukowski)

^ permalink raw reply

* Re: BUG ? ipip unregister_netdevice_many()
From: Eric W. Biederman @ 2010-10-08 16:58 UTC (permalink / raw)
  To: Daniel Lezcano; +Cc: Hans Schillstrom, netdev@vger.kernel.org
In-Reply-To: <4CAF4408.6070001@free.fr>

Daniel Lezcano <daniel.lezcano@free.fr> writes:

> On 10/08/2010 05:53 PM, Daniel Lezcano wrote:
>> On 10/08/2010 02:28 PM, Hans Schillstrom wrote:
>>> Hi Eric,
>>> Any advice how to trace this down ?
>>> This rollback_registered_many() seems to have on the lists before...
>>> All IPv4 and IPv6 tunnels causes this crash, all you have to do is
>>> load the tunnel module(s)
>>> enter a new ns and exit from it.
>>>
>>> Have not tested any more devices than tunnels,
>>> I did an "ip link delete" on my macvlans before exiting the ns.
>>
>> Ah ! I succeed to reproduce it.
>> It does not appear immediately in fact.
>>
>> I am trying to simplify the configuration but I am falling in the bug I
>> talked about in the previous email.
>
> Ok, so after investigating, we just need a macvlan and specify an ipv6 address
> for it (inside a new netns of course), and the loopback is not released. I
> compiled out the tunnels, so they are not related to this problem I think.
>
> That reduces the scope of investigation :)

This reproduces the unable to free nedevice problem?  Or the bad
pointer in macvlan_close problem?

Eric

^ permalink raw reply

* Re: BUG ? ipip unregister_netdevice_many()
From: Eric W. Biederman @ 2010-10-08 16:51 UTC (permalink / raw)
  To: Daniel Lezcano; +Cc: Hans Schillstrom, netdev@vger.kernel.org
In-Reply-To: <4CAEFE2C.3010007@free.fr>

Daniel Lezcano <daniel.lezcano@free.fr> writes:

> On 10/07/2010 10:48 AM, Hans Schillstrom wrote:
>> Hello
>> I'm trying to exit a network name space and it doesn't work (or am I doing something wrong?)
>> The only netdevices left are lo and the tunnels ip6tnl0, sit0 and tunl0 when exiting netns.
>>
>> A netns is created by lxc-execute with two interfaces eth0 eth1 (macvlan)
>> (see conf file at the end)
>>
>> Kernel: net-next-2.6 top from 4 october 2010
>>    
>
> Hi Hans,
>
> I tried to reproduce your problem but I just get a big kernel crash when exiting
> the container :/
>
> The stack is different but it may be related to the same problem.

Double ouch.  My guess that this is more related to the recent macvlan
changes.

It looks like there is plenty of debugging work to do :(

Ouch Ouch Ouch!

Eric

^ permalink raw reply

* Find Attached Document For Information
From: United Nation @ 2010-10-08 15:34 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 39 bytes --]

Find Attached Document For Information

[-- Attachment #2: UN.txt --]
[-- Type: application/octet-stream, Size: 1745 bytes --]

Attention:

Goodmorning and how are you today? Hope all is well with you and your family? You may not understand why this mail came to you. We have been having a meeting for the passed 4 months which ended 2 days ago with the then secretary to the UNITED NATIONS.

This email is to all the people that have been scammed in any part of the world or been received false information and asked to pay fees, the UNITED NATIONS have agreed to compensate them with the sum of One Hundred and Fifty Thousand United States Dollars (US$ 150.000.00), this is been made to people that have been cheated from Lottery Scam, Investment, ATM, etc and this includes also every foreign contractors that may have not received their contract sum, and people that have had an unfinished transaction or international businesses that failed due to Government policies or problems etc.

We found your name on our list, at some registered business site and that is why we are contacting you, this have been agreed upon and have been signed.

You are advised to contact Mr. Mike Ellis or John Brown of IEPO, as they are our representative in that Region, contact him immediately for your immediate payment. In contact with him, you will be required to provide and inform them the option preferable to receive the funds.

Therefore, you should send him the following information below:
**Full Name
**Telephone number
**Contact Address and Scanned Copy of your ID for identification.

Contact Mr. Mike Ellis
Email: mkeellis5@aol.com

Thanks and God bless you and your family.Hoping to hear from you as soon as you cash your Bank Draft.

Making the world a better place

Regards,
Mr. Saani John E.
Information/Notification Department (UNITED NATIONS)

^ permalink raw reply

* Re: BUG ? ipip unregister_netdevice_many()
From: Eric W. Biederman @ 2010-10-08 16:45 UTC (permalink / raw)
  To: Hans Schillstrom; +Cc: Daniel Lezcano, netdev@vger.kernel.org
In-Reply-To: <201010081428.37639.hans.schillstrom@ericsson.com>

Hans Schillstrom <hans.schillstrom@ericsson.com> writes:

> Hi Eric,
> Any advice how to trace this down ?
> This rollback_registered_many() seems to have on the lists before...

That is just the core piece that does device registration.

> All IPv4 and IPv6 tunnels causes this crash, all you have to do is load the tunnel module(s)
> enter a new ns and exit from it.

Ouch.  

> Have not tested any more devices than tunnels, 
> I did an "ip link delete" on my macvlans before exiting the ns.

My hunch is that we have dst entry problems, as I know those hop network
interfaces when we destroy network devices, but I have seen weird issues
with the route cache as well.

Grumble.

Eric

^ permalink raw reply

* [PATCH net-next] net dst: use a percpu_counter to track entries
From: Eric Dumazet @ 2010-10-08 16:37 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

struct dst_ops tracks number of allocated dst in an atomic_t field,
subject to high cache line contention in stress workload.

Switch to a percpu_counter, to reduce number of time we need to dirty a
central location. Place it on a separate cache line to avoid dirtying
read only fields.

Stress test :

(Sending 160.000.000 UDP frames,
IP route cache disabled, dual E5540 @2.53GHz,
32bit kernel, FIB_TRIE, SLUB/NUMA)

Before:

real    0m51.179s
user    0m15.329s
sys     10m15.942s

After:

real	0m45.570s
user	0m15.525s
sys	9m56.669s

With a small reordering of struct neighbour fields, subject of a
following patch, (to separate refcnt from other read mostly fields)

real	0m41.841s
user	0m15.261s
sys	8m45.949s

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/dst_ops.h     |   37 +++++++++++++++++++++++++++++++++++-
 net/bridge/br_netfilter.c |   11 ++++++++--
 net/core/dst.c            |    6 ++---
 net/decnet/dn_route.c     |    3 +-
 net/ipv4/route.c          |   36 +++++++++++++++++++++--------------
 net/ipv4/xfrm4_policy.c   |    4 +--
 net/ipv6/route.c          |   28 +++++++++++++++++++--------
 net/ipv6/xfrm6_policy.c   |   10 +++++----
 8 files changed, 100 insertions(+), 35 deletions(-)

diff --git a/include/net/dst_ops.h b/include/net/dst_ops.h
index d1ff9b7..1fa5306 100644
--- a/include/net/dst_ops.h
+++ b/include/net/dst_ops.h
@@ -1,6 +1,7 @@
 #ifndef _NET_DST_OPS_H
 #define _NET_DST_OPS_H
 #include <linux/types.h>
+#include <linux/percpu_counter.h>
 
 struct dst_entry;
 struct kmem_cachep;
@@ -22,7 +23,41 @@ struct dst_ops {
 	void			(*update_pmtu)(struct dst_entry *dst, u32 mtu);
 	int			(*local_out)(struct sk_buff *skb);
 
-	atomic_t		entries;
 	struct kmem_cache	*kmem_cachep;
+
+	struct percpu_counter	pcpuc_entries ____cacheline_aligned_in_smp;
 };
+
+static inline int dst_entries_get_fast(struct dst_ops *dst)
+{
+	return percpu_counter_read_positive(&dst->pcpuc_entries);
+}
+
+static inline int dst_entries_get_slow(struct dst_ops *dst)
+{
+	int res;
+
+	local_bh_disable();
+	res = percpu_counter_sum_positive(&dst->pcpuc_entries);
+	local_bh_enable();
+	return res;
+}
+
+static inline void dst_entries_add(struct dst_ops *dst, int val)
+{
+	local_bh_disable();
+	percpu_counter_add(&dst->pcpuc_entries, val);
+	local_bh_enable();
+}
+
+static inline int dst_entries_init(struct dst_ops *dst)
+{
+	return percpu_counter_init(&dst->pcpuc_entries, 0);
+}
+
+static inline void dst_entries_destroy(struct dst_ops *dst)
+{
+	percpu_counter_destroy(&dst->pcpuc_entries);
+}
+
 #endif
diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index 77f7b5f..7f9ce96 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -106,7 +106,6 @@ static struct dst_ops fake_dst_ops = {
 	.family =		AF_INET,
 	.protocol =		cpu_to_be16(ETH_P_IP),
 	.update_pmtu =		fake_update_pmtu,
-	.entries =		ATOMIC_INIT(0),
 };
 
 /*
@@ -1003,15 +1002,22 @@ int __init br_netfilter_init(void)
 {
 	int ret;
 
-	ret = nf_register_hooks(br_nf_ops, ARRAY_SIZE(br_nf_ops));
+	ret = dst_entries_init(&fake_dst_ops);
 	if (ret < 0)
 		return ret;
+
+	ret = nf_register_hooks(br_nf_ops, ARRAY_SIZE(br_nf_ops));
+	if (ret < 0) {
+		dst_entries_destroy(&fake_dst_ops);
+		return ret;
+	}
 #ifdef CONFIG_SYSCTL
 	brnf_sysctl_header = register_sysctl_paths(brnf_path, brnf_table);
 	if (brnf_sysctl_header == NULL) {
 		printk(KERN_WARNING
 		       "br_netfilter: can't register to sysctl.\n");
 		nf_unregister_hooks(br_nf_ops, ARRAY_SIZE(br_nf_ops));
+		dst_entries_destroy(&fake_dst_ops);
 		return -ENOMEM;
 	}
 #endif
@@ -1025,4 +1031,5 @@ void br_netfilter_fini(void)
 #ifdef CONFIG_SYSCTL
 	unregister_sysctl_table(brnf_sysctl_header);
 #endif
+	dst_entries_destroy(&fake_dst_ops);
 }
diff --git a/net/core/dst.c b/net/core/dst.c
index 6c41b1f..eebcf4f 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -168,7 +168,7 @@ void *dst_alloc(struct dst_ops *ops)
 {
 	struct dst_entry *dst;
 
-	if (ops->gc && atomic_read(&ops->entries) > ops->gc_thresh) {
+	if (ops->gc && dst_entries_get_fast(ops) > ops->gc_thresh) {
 		if (ops->gc(ops))
 			return NULL;
 	}
@@ -183,7 +183,7 @@ void *dst_alloc(struct dst_ops *ops)
 #if RT_CACHE_DEBUG >= 2
 	atomic_inc(&dst_total);
 #endif
-	atomic_inc(&ops->entries);
+	dst_entries_add(ops, 1);
 	return dst;
 }
 EXPORT_SYMBOL(dst_alloc);
@@ -236,7 +236,7 @@ again:
 		neigh_release(neigh);
 	}
 
-	atomic_dec(&dst->ops->entries);
+	dst_entries_add(dst->ops, -1);
 
 	if (dst->ops->destroy)
 		dst->ops->destroy(dst);
diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c
index 6585ea6..df0f3e5 100644
--- a/net/decnet/dn_route.c
+++ b/net/decnet/dn_route.c
@@ -132,7 +132,6 @@ static struct dst_ops dn_dst_ops = {
 	.negative_advice =	dn_dst_negative_advice,
 	.link_failure =		dn_dst_link_failure,
 	.update_pmtu =		dn_dst_update_pmtu,
-	.entries =		ATOMIC_INIT(0),
 };
 
 static __inline__ unsigned dn_hash(__le16 src, __le16 dst)
@@ -1758,6 +1757,7 @@ void __init dn_route_init(void)
 	dn_dst_ops.kmem_cachep =
 		kmem_cache_create("dn_dst_cache", sizeof(struct dn_route), 0,
 				  SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
+	dst_entries_init(&dn_dst_ops);
 	setup_timer(&dn_route_timer, dn_dst_check_expire, 0);
 	dn_route_timer.expires = jiffies + decnet_dst_gc_interval * HZ;
 	add_timer(&dn_route_timer);
@@ -1816,5 +1816,6 @@ void __exit dn_route_cleanup(void)
 	dn_run_flush(0);
 
 	proc_net_remove(&init_net, "decnet_cache");
+	dst_entries_destroy(&dn_dst_ops);
 }
 
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 7864d0c..831c6d4 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -159,7 +159,6 @@ static struct dst_ops ipv4_dst_ops = {
 	.link_failure =		ipv4_link_failure,
 	.update_pmtu =		ip_rt_update_pmtu,
 	.local_out =		__ip_local_out,
-	.entries =		ATOMIC_INIT(0),
 };
 
 #define ECN_OR_COST(class)	TC_PRIO_##class
@@ -466,7 +465,7 @@ static int rt_cpu_seq_show(struct seq_file *seq, void *v)
 
 	seq_printf(seq,"%08x  %08x %08x %08x %08x %08x %08x %08x "
 		   " %08x %08x %08x %08x %08x %08x %08x %08x %08x \n",
-		   atomic_read(&ipv4_dst_ops.entries),
+		   dst_entries_get_slow(&ipv4_dst_ops),
 		   st->in_hit,
 		   st->in_slow_tot,
 		   st->in_slow_mc,
@@ -945,6 +944,7 @@ static int rt_garbage_collect(struct dst_ops *ops)
 	struct rtable *rth, **rthp;
 	unsigned long now = jiffies;
 	int goal;
+	int entries = dst_entries_get_fast(&ipv4_dst_ops);
 
 	/*
 	 * Garbage collection is pretty expensive,
@@ -954,28 +954,28 @@ static int rt_garbage_collect(struct dst_ops *ops)
 	RT_CACHE_STAT_INC(gc_total);
 
 	if (now - last_gc < ip_rt_gc_min_interval &&
-	    atomic_read(&ipv4_dst_ops.entries) < ip_rt_max_size) {
+	    entries < ip_rt_max_size) {
 		RT_CACHE_STAT_INC(gc_ignored);
 		goto out;
 	}
 
+	entries = dst_entries_get_slow(&ipv4_dst_ops);
 	/* Calculate number of entries, which we want to expire now. */
-	goal = atomic_read(&ipv4_dst_ops.entries) -
-		(ip_rt_gc_elasticity << rt_hash_log);
+	goal = entries - (ip_rt_gc_elasticity << rt_hash_log);
 	if (goal <= 0) {
 		if (equilibrium < ipv4_dst_ops.gc_thresh)
 			equilibrium = ipv4_dst_ops.gc_thresh;
-		goal = atomic_read(&ipv4_dst_ops.entries) - equilibrium;
+		goal = entries - equilibrium;
 		if (goal > 0) {
 			equilibrium += min_t(unsigned int, goal >> 1, rt_hash_mask + 1);
-			goal = atomic_read(&ipv4_dst_ops.entries) - equilibrium;
+			goal = entries - equilibrium;
 		}
 	} else {
 		/* We are in dangerous area. Try to reduce cache really
 		 * aggressively.
 		 */
 		goal = max_t(unsigned int, goal >> 1, rt_hash_mask + 1);
-		equilibrium = atomic_read(&ipv4_dst_ops.entries) - goal;
+		equilibrium = entries - goal;
 	}
 
 	if (now - last_gc >= ip_rt_gc_min_interval)
@@ -1032,14 +1032,16 @@ static int rt_garbage_collect(struct dst_ops *ops)
 		expire >>= 1;
 #if RT_CACHE_DEBUG >= 2
 		printk(KERN_DEBUG "expire>> %u %d %d %d\n", expire,
-				atomic_read(&ipv4_dst_ops.entries), goal, i);
+				dst_entries_get_fast(&ipv4_dst_ops), goal, i);
 #endif
 
-		if (atomic_read(&ipv4_dst_ops.entries) < ip_rt_max_size)
+		if (dst_entries_get_fast(&ipv4_dst_ops) < ip_rt_max_size)
 			goto out;
 	} while (!in_softirq() && time_before_eq(jiffies, now));
 
-	if (atomic_read(&ipv4_dst_ops.entries) < ip_rt_max_size)
+	if (dst_entries_get_fast(&ipv4_dst_ops) < ip_rt_max_size)
+		goto out;
+	if (dst_entries_get_slow(&ipv4_dst_ops) < ip_rt_max_size)
 		goto out;
 	if (net_ratelimit())
 		printk(KERN_WARNING "dst cache overflow\n");
@@ -1049,11 +1051,12 @@ static int rt_garbage_collect(struct dst_ops *ops)
 work_done:
 	expire += ip_rt_gc_min_interval;
 	if (expire > ip_rt_gc_timeout ||
-	    atomic_read(&ipv4_dst_ops.entries) < ipv4_dst_ops.gc_thresh)
+	    dst_entries_get_fast(&ipv4_dst_ops) < ipv4_dst_ops.gc_thresh ||
+	    dst_entries_get_slow(&ipv4_dst_ops) < ipv4_dst_ops.gc_thresh)
 		expire = ip_rt_gc_timeout;
 #if RT_CACHE_DEBUG >= 2
 	printk(KERN_DEBUG "expire++ %u %d %d %d\n", expire,
-			atomic_read(&ipv4_dst_ops.entries), goal, rover);
+			dst_entries_get_fast(&ipv4_dst_ops), goal, rover);
 #endif
 out:	return 0;
 }
@@ -2719,7 +2722,6 @@ static struct dst_ops ipv4_dst_blackhole_ops = {
 	.destroy		=	ipv4_dst_destroy,
 	.check			=	ipv4_blackhole_dst_check,
 	.update_pmtu		=	ipv4_rt_blackhole_update_pmtu,
-	.entries		=	ATOMIC_INIT(0),
 };
 
 
@@ -3289,6 +3291,12 @@ int __init ip_rt_init(void)
 
 	ipv4_dst_blackhole_ops.kmem_cachep = ipv4_dst_ops.kmem_cachep;
 
+	if (dst_entries_init(&ipv4_dst_ops) < 0)
+		panic("IP: failed to allocate ipv4_dst_ops counter\n");
+
+	if (dst_entries_init(&ipv4_dst_blackhole_ops) < 0)
+		panic("IP: failed to allocate ipv4_dst_blackhole_ops counter\n");
+
 	rt_hash_table = (struct rt_hash_bucket *)
 		alloc_large_system_hash("IP route cache",
 					sizeof(struct rt_hash_bucket),
diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c
index a580349..4464f3b 100644
--- a/net/ipv4/xfrm4_policy.c
+++ b/net/ipv4/xfrm4_policy.c
@@ -174,7 +174,7 @@ static inline int xfrm4_garbage_collect(struct dst_ops *ops)
 	struct net *net = container_of(ops, struct net, xfrm.xfrm4_dst_ops);
 
 	xfrm4_policy_afinfo.garbage_collect(net);
-	return (atomic_read(&ops->entries) > ops->gc_thresh * 2);
+	return (dst_entries_get_slow(ops) > ops->gc_thresh * 2);
 }
 
 static void xfrm4_update_pmtu(struct dst_entry *dst, u32 mtu)
@@ -232,7 +232,6 @@ static struct dst_ops xfrm4_dst_ops = {
 	.ifdown =		xfrm4_dst_ifdown,
 	.local_out =		__ip_local_out,
 	.gc_thresh =		1024,
-	.entries =		ATOMIC_INIT(0),
 };
 
 static struct xfrm_policy_afinfo xfrm4_policy_afinfo = {
@@ -288,6 +287,7 @@ void __init xfrm4_init(int rt_max_size)
 	 * and start cleaning when were 1/2 full
 	 */
 	xfrm4_dst_ops.gc_thresh = rt_max_size/2;
+	dst_entries_init(&xfrm4_dst_ops);
 
 	xfrm4_state_init();
 	xfrm4_policy_init();
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 17e2179..25661f9 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -109,7 +109,6 @@ static struct dst_ops ip6_dst_ops_template = {
 	.link_failure		=	ip6_link_failure,
 	.update_pmtu		=	ip6_rt_update_pmtu,
 	.local_out		=	__ip6_local_out,
-	.entries		=	ATOMIC_INIT(0),
 };
 
 static void ip6_rt_blackhole_update_pmtu(struct dst_entry *dst, u32 mtu)
@@ -122,7 +121,6 @@ static struct dst_ops ip6_dst_blackhole_ops = {
 	.destroy		=	ip6_dst_destroy,
 	.check			=	ip6_dst_check,
 	.update_pmtu		=	ip6_rt_blackhole_update_pmtu,
-	.entries		=	ATOMIC_INIT(0),
 };
 
 static struct rt6_info ip6_null_entry_template = {
@@ -1058,19 +1056,22 @@ static int ip6_dst_gc(struct dst_ops *ops)
 	int rt_elasticity = net->ipv6.sysctl.ip6_rt_gc_elasticity;
 	int rt_gc_timeout = net->ipv6.sysctl.ip6_rt_gc_timeout;
 	unsigned long rt_last_gc = net->ipv6.ip6_rt_last_gc;
+	int entries;
 
+	entries = dst_entries_get_fast(ops);
 	if (time_after(rt_last_gc + rt_min_interval, now) &&
-	    atomic_read(&ops->entries) <= rt_max_size)
+	    entries <= rt_max_size)
 		goto out;
 
 	net->ipv6.ip6_rt_gc_expire++;
 	fib6_run_gc(net->ipv6.ip6_rt_gc_expire, net);
 	net->ipv6.ip6_rt_last_gc = now;
-	if (atomic_read(&ops->entries) < ops->gc_thresh)
+	entries = dst_entries_get_slow(ops);
+	if (entries < ops->gc_thresh)
 		net->ipv6.ip6_rt_gc_expire = rt_gc_timeout>>1;
 out:
 	net->ipv6.ip6_rt_gc_expire -= net->ipv6.ip6_rt_gc_expire>>rt_elasticity;
-	return atomic_read(&ops->entries) > rt_max_size;
+	return entries > rt_max_size;
 }
 
 /* Clean host part of a prefix. Not necessary in radix tree,
@@ -2524,7 +2525,7 @@ static int rt6_stats_seq_show(struct seq_file *seq, void *v)
 		   net->ipv6.rt6_stats->fib_rt_alloc,
 		   net->ipv6.rt6_stats->fib_rt_entries,
 		   net->ipv6.rt6_stats->fib_rt_cache,
-		   atomic_read(&net->ipv6.ip6_dst_ops.entries),
+		   dst_entries_get_slow(&net->ipv6.ip6_dst_ops),
 		   net->ipv6.rt6_stats->fib_discarded_routes);
 
 	return 0;
@@ -2666,11 +2667,14 @@ static int __net_init ip6_route_net_init(struct net *net)
 	memcpy(&net->ipv6.ip6_dst_ops, &ip6_dst_ops_template,
 	       sizeof(net->ipv6.ip6_dst_ops));
 
+	if (dst_entries_init(&net->ipv6.ip6_dst_ops) < 0)
+		goto out_ip6_dst_ops;
+
 	net->ipv6.ip6_null_entry = kmemdup(&ip6_null_entry_template,
 					   sizeof(*net->ipv6.ip6_null_entry),
 					   GFP_KERNEL);
 	if (!net->ipv6.ip6_null_entry)
-		goto out_ip6_dst_ops;
+		goto out_ip6_dst_entries;
 	net->ipv6.ip6_null_entry->dst.path =
 		(struct dst_entry *)net->ipv6.ip6_null_entry;
 	net->ipv6.ip6_null_entry->dst.ops = &net->ipv6.ip6_dst_ops;
@@ -2720,6 +2724,8 @@ out_ip6_prohibit_entry:
 out_ip6_null_entry:
 	kfree(net->ipv6.ip6_null_entry);
 #endif
+out_ip6_dst_entries:
+	dst_entries_destroy(&net->ipv6.ip6_dst_ops);
 out_ip6_dst_ops:
 	goto out;
 }
@@ -2758,10 +2764,14 @@ int __init ip6_route_init(void)
 	if (!ip6_dst_ops_template.kmem_cachep)
 		goto out;
 
-	ret = register_pernet_subsys(&ip6_route_net_ops);
+	ret = dst_entries_init(&ip6_dst_blackhole_ops);
 	if (ret)
 		goto out_kmem_cache;
 
+	ret = register_pernet_subsys(&ip6_route_net_ops);
+	if (ret)
+		goto out_dst_entries;
+
 	ip6_dst_blackhole_ops.kmem_cachep = ip6_dst_ops_template.kmem_cachep;
 
 	/* Registering of the loopback is done before this portion of code,
@@ -2808,6 +2818,8 @@ out_fib6_init:
 	fib6_gc_cleanup();
 out_register_subsys:
 	unregister_pernet_subsys(&ip6_route_net_ops);
+out_dst_entries:
+	dst_entries_destroy(&ip6_dst_blackhole_ops);
 out_kmem_cache:
 	kmem_cache_destroy(ip6_dst_ops_template.kmem_cachep);
 	goto out;
diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index 39676ea..7e74023 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -199,7 +199,7 @@ static inline int xfrm6_garbage_collect(struct dst_ops *ops)
 	struct net *net = container_of(ops, struct net, xfrm.xfrm6_dst_ops);
 
 	xfrm6_policy_afinfo.garbage_collect(net);
-	return atomic_read(&ops->entries) > ops->gc_thresh * 2;
+	return dst_entries_get_fast(ops) > ops->gc_thresh * 2;
 }
 
 static void xfrm6_update_pmtu(struct dst_entry *dst, u32 mtu)
@@ -255,7 +255,6 @@ static struct dst_ops xfrm6_dst_ops = {
 	.ifdown =		xfrm6_dst_ifdown,
 	.local_out =		__ip6_local_out,
 	.gc_thresh =		1024,
-	.entries =		ATOMIC_INIT(0),
 };
 
 static struct xfrm_policy_afinfo xfrm6_policy_afinfo = {
@@ -312,11 +311,13 @@ int __init xfrm6_init(void)
 	 */
 	gc_thresh = FIB6_TABLE_HASHSZ * 8;
 	xfrm6_dst_ops.gc_thresh = (gc_thresh < 1024) ? 1024 : gc_thresh;
+	dst_entries_init(&xfrm6_dst_ops);
 
 	ret = xfrm6_policy_init();
-	if (ret)
+	if (ret) {
+		dst_entries_destroy(&xfrm6_dst_ops);
 		goto out;
-
+	}
 	ret = xfrm6_state_init();
 	if (ret)
 		goto out_policy;
@@ -341,4 +342,5 @@ void xfrm6_fini(void)
 	//xfrm6_input_fini();
 	xfrm6_policy_fini();
 	xfrm6_state_fini();
+	dst_entries_destroy(&xfrm6_dst_ops);
 }



^ permalink raw reply related

* Re: IPv4: sysctl table check failed [was: mmotm 2010-10-07-14-08 uploaded]
From: Américo Wang @ 2010-10-08 16:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric Dumazet, Jiri Slaby, linux-kernel, mm-commits, ML netdev,
	David S. Miller, Eric W. Biederman
In-Reply-To: <20101007152806.119d1522.akpm@linux-foundation.org>

On Thu, Oct 07, 2010 at 03:28:06PM -0700, Andrew Morton wrote:
>On Fri, 08 Oct 2010 00:22:15 +0200
>Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>> Le vendredi 08 octobre 2010 __ 00:06 +0200, Jiri Slaby a __crit :
>> > On 10/07/2010 11:08 PM, akpm@linux-foundation.org wrote:
>> > > The mm-of-the-moment snapshot 2010-10-07-14-08 has been uploaded to
>> > 
>> > Hi, I got bunch of "sysctl table check failed" below. All seem to be
>> > related to ipv4:
>> 
>> I would say, sysctl check is buggy :(
>> 
>> min/max are optional
>> 
>> [PATCH] sysctl: min/max bounds are optional
>> 
>> sysctl check complains when proc_doulongvec_minmax or
>> proc_doulongvec_ms_jiffies_minmax are used by a vector of longs (with
>> more than one element), with no min or max value specified.
>> 
>> This is unexpected, given we had a bug on this min/max handling :)
>> 
>> Reported-by: Jiri Slaby <jirislaby@gmail.com>
>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>> ---
>>  kernel/sysctl_check.c |    9 ---------
>>  1 file changed, 9 deletions(-)
>> 
>> diff --git a/kernel/sysctl_check.c b/kernel/sysctl_check.c
>> index 04cdcf7..10b90d8 100644
>> --- a/kernel/sysctl_check.c
>> +++ b/kernel/sysctl_check.c
>> @@ -143,15 +143,6 @@ int sysctl_check_table(struct nsproxy *namespaces, struct ctl_table *table)
>>  				if (!table->maxlen)
>>  					set_fail(&fail, table, "No maxlen");
>>  			}
>> -			if ((table->proc_handler == proc_doulongvec_minmax) ||
>> -			    (table->proc_handler == proc_doulongvec_ms_jiffies_minmax)) {
>> -				if (table->maxlen > sizeof (unsigned long)) {
>> -					if (!table->extra1)
>> -						set_fail(&fail, table, "No min");
>> -					if (!table->extra2)
>> -						set_fail(&fail, table, "No max");
>> -				}
>> -			}
>>  #ifdef CONFIG_PROC_SYSCTL
>>  			if (table->procname && !table->proc_handler)
>>  				set_fail(&fail, table, "No proc_handler");
>
>That will probably fix it ;)


Yeah, it looks good for me too,

Acked-by: WANG Cong <xiyou.wangcong@gmail.com>

>
>net-avoid-limits-overflow.patch is dependent on this patch.  Unless
>Eric B squeaks I'll plan on sending this patch in for 2.6.37.
>

Eirc B reminded me we should check the code in sysctl_check.c,
but I forgot. The patch from Eric D is exactly what we need here.

Thanks.


^ permalink raw reply

* Re: [PATCH 1/2] r8169: allocate with GFP_KERNEL flag when able to sleep
From: Eric Dumazet @ 2010-10-08 16:27 UTC (permalink / raw)
  To: Stanislaw Gruszka; +Cc: Francois Romieu, netdev
In-Reply-To: <20101008160341.GC10393@redhat.com>

Le vendredi 08 octobre 2010 à 18:03 +0200, Stanislaw Gruszka a écrit :
> On Fri, Oct 08, 2010 at 05:04:07PM +0200, Eric Dumazet wrote:
> > Le vendredi 08 octobre 2010 à 16:52 +0200, Stanislaw Gruszka a écrit :
> > > On Fri, Oct 08, 2010 at 04:25:00PM +0200, Stanislaw Gruszka wrote:
> > > > We have fedora bug report where driver fail to initialize after
> > > > suspend/resume because of memory allocation errors:
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=629158
> > > 
> > > There is also one more thing to do regarding above. Calltraces from bug
> > > reports, shows that order 3 allocation fail. On arch with 4kB pages,
> > > order 3 mean 32kB allocation. We want to alloc 16kB, but there is also
> > > internal sk_buff data what make that we exceed the boundary and take
> > > 32kB from allocator, getting almost 50% wastage.
> > > 
> > 
> > Or its only an 1460+overhead allocation, and SLUB uses order-3 pages to
> > satisfy 2048 bytes allocations.
> 
> Rather not, trace show failure in rtl8169_rx_fill, where we allocate rx
> buffers and these are 16kB big by default.
> 

Only when gfp_t is GFP_KERNEL to fill rx buffers. (after your patch
applied of course). This should succeed. If not, driver cannot load and
function, since this NIC really needs 16KB buffers in order to avoid a
hardware bug.

Once allocated for RX rings, we never free them (never give this skb to
upper stack) : When we receive a frame, we copybreak it, (using
GFP_ATOMIC) so it depends on MTU.

With MTU=1500, I am pretty sure we allocate 2048 bytes chunks, not more.

> I think, only on these drivers which do alloc_skb(n*PAGE_SIZE).
> As alternative we can be smarter in alloc_skb.

Only if MTU is non standard, then.

I repeat : With standard MTU=1500, we dont allocate huge skbs in rx
path, only small (<2048 bytes) ones.

For bigger frames, then you might allocate fragments, using pages, and
dont care if PAGE_SIZE is 64Kbytes.

^ permalink raw reply

* Re: [PATCH] sysctl: fix min/max handling in __do_proc_doulongvec_minmax()
From: Américo Wang @ 2010-10-08 16:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andrew Morton, Eric Dumazet, Américo Wang, Robin Holt,
	linux-kernel, Willy Tarreau, David S. Miller, netdev,
	James Morris, Pekka Savola (ipv6), Patrick McHardy,
	Alexey Kuznetsov
In-Reply-To: <m1vd5d3ia9.fsf@fess.ebiederm.org>

On Thu, Oct 07, 2010 at 12:38:22PM -0700, Eric W. Biederman wrote:
>Andrew Morton <akpm@linux-foundation.org> writes:
>
>> On Thu, 07 Oct 2010 18:59:03 +0200
>> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>>> Thats fine by me, thanks Eric.
>>> 
>>> Andrew, please remove previous patch from your tree and replace it by
>>> following one :
>>> 
>>> [PATCH v2] sysctl: fix min/max handling in __do_proc_doulongvec_minmax()
>>> 
>>> When proc_doulongvec_minmax() is used with an array of longs,
>>> and no min/max check requested (.extra1 or .extra2 being NULL), we
>>> dereference a NULL pointer for the second element of the array.
>>> 
>>> Noticed while doing some changes in network stack for the "16TB problem"
>>> 
>>> Fix is to not change min & max pointers in
>>> __do_proc_doulongvec_minmax(), so that all elements of the vector share
>>> an unique min/max limit, like proc_dointvec_minmax().
>>> 
>>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>>> ---
>>>  kernel/sysctl.c |    2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>> 
>>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>>> index f88552c..8e45451 100644
>>> --- a/kernel/sysctl.c
>>> +++ b/kernel/sysctl.c
>>> @@ -2485,7 +2485,7 @@ static int __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, int
>>>  		kbuf[left] = 0;
>>>  	}
>>>  
>>> -	for (; left && vleft--; i++, min++, max++, first=0) {
>>> +	for (; left && vleft--; i++, first=0) {
>>>  		unsigned long val;
>>>  
>>>  		if (write) {
>>
>> Did we check to see whether any present callers are passing in pointers
>> to arrays of min/max values?
>
>In 2.6.36 there are not any callers that pass in a vector of anything, I
>don't know about linux-next.  It looks to me like incrementing min and
>max was simply a bug.
>

Agreed, I checked them too.

>> I wonder if there's any documentation for this interface which just
>> became wrong.
>
>Or it just became right.  Clearly no one has been expecting min
>and max to be vectors.
>

I think we need to document this before we rewrite the code.

-- 
Live like a child, think like the god.
 

^ permalink raw reply

* Re: BUG ? ipip unregister_netdevice_many()
From: Daniel Lezcano @ 2010-10-08 16:17 UTC (permalink / raw)
  Cc: Hans Schillstrom, Eric W. Biederman, netdev@vger.kernel.org
In-Reply-To: <4CAF3E78.8030202@free.fr>

On 10/08/2010 05:53 PM, Daniel Lezcano wrote:
> On 10/08/2010 02:28 PM, Hans Schillstrom wrote:
>> Hi Eric,
>> Any advice how to trace this down ?
>> This rollback_registered_many() seems to have on the lists before...
>> All IPv4 and IPv6 tunnels causes this crash, all you have to do is
>> load the tunnel module(s)
>> enter a new ns and exit from it.
>>
>> Have not tested any more devices than tunnels,
>> I did an "ip link delete" on my macvlans before exiting the ns.
>
> Ah ! I succeed to reproduce it.
> It does not appear immediately in fact.
>
> I am trying to simplify the configuration but I am falling in the bug I
> talked about in the previous email.

Ok, so after investigating, we just need a macvlan and specify an ipv6 
address for it (inside a new netns of course), and the loopback is not 
released. I compiled out the tunnels, so they are not related to this 
problem I think.

That reduces the scope of investigation :)

Looking forward ...

   -- Daniel

^ permalink raw reply

* Re: [PATCH] sysctl: fix min/max handling in __do_proc_doulongvec_minmax()
From: Américo Wang @ 2010-10-08 16:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Américo Wang, Robin Holt, Andrew Morton, linux-kernel,
	Willy Tarreau, David S. Miller, netdev, James Morris,
	Pekka Savola (ipv6), Patrick McHardy, Alexey Kuznetsov, ebiederm
In-Reply-To: <1286445081.2912.15.camel@edumazet-laptop>

On Thu, Oct 07, 2010 at 11:51:21AM +0200, Eric Dumazet wrote:
>Le jeudi 07 octobre 2010 à 17:25 +0800, Américo Wang a écrit :
>> >>
>> >
>> >Here is the final one.
>> 
>> Oops, that one is not correct. Hopefully this one
>> is correct.
>> 
>> --------------->
>> 
>> Eric D. noticed that we may trigger an OOPS if we leave ->extra{1,2}
>> to NULL when we use proc_doulongvec_minmax().
>> 
>> Actually, we don't need to store min/max values in a vector,
>> because all the elements in the vector should share the same min/max
>> value, like what proc_dointvec_minmax() does.
>> 
>
>If we assert same min/max limits are to be applied to all elements,
>a much simpler fix than yours would be :
>
>diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>index f88552c..8e45451 100644
>--- a/kernel/sysctl.c
>+++ b/kernel/sysctl.c
>@@ -2485,7 +2485,7 @@ static int __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, int
> 		kbuf[left] = 0;
> 	}
> 
>-	for (; left && vleft--; i++, min++, max++, first=0) {
>+	for (; left && vleft--; i++, first=0) {
> 		unsigned long val;
> 
> 		if (write) {
>
>
>Please dont send huge patches like this to 'fix' a bug,
>especially on slow path.

Well, my patch makes that horrible code a little better. :)

>
>First we fix the bug, _then_ we can try to make code more 
>efficient or more pretty or shorter.
>
>So the _real_ question is :
>
>Should the min/max limits should be a single pair,
>shared by all elements, or a vector of limits.
>

Yes, actually I talked with Eric W. about this before
sending the patch.

I also checked the users of proc_doulongvec_minmax(),
none of them are using more than one limit, so it is
safe to remove that.


-- 
Live like a child, think like the god.
 

^ permalink raw reply

* Re: BUG ? ipip unregister_netdevice_many()
From: Eric W. Biederman @ 2010-10-08 16:06 UTC (permalink / raw)
  To: Hans Schillstrom; +Cc: netdev@vger.kernel.org, Daniel Lezcano
In-Reply-To: <201010071048.12817.hans.schillstrom@ericsson.com>

Hans Schillstrom <hans.schillstrom@ericsson.com> writes:

> Hello
> I'm trying to exit a network name space and it doesn't work (or am I doing something wrong?)
> The only netdevices left are lo and the tunnels ip6tnl0, sit0 and tunl0 when exiting netns.
>
> A netns is created by lxc-execute with two interfaces eth0 eth1 (macvlan)
> (see conf file at the end)
>
> Kernel: net-next-2.6 top from 4 october 2010
>
> I added some printk's inn ipip.c  ipip_exit_net()
> ...
>         rtnl_lock();
>         printk(KERN_ERR "ipip_exit_net(enter)\n");
>         ipip_destroy_tunnels(ipn, &list);
>         printk(KERN_ERR "ipip_exit_net(1)\n");
>         unregister_netdevice_queue(ipn->fb_tunnel_dev, &list);
>         printk(KERN_ERR "ipip_exit_net(2)\n");
>         unregister_netdevice_many(&list);
>         printk(KERN_ERR "ipip_exit_net(3)\n");
>         rtnl_unlock();
>         printk(KERN_ERR "ipip_exit_net(exit)\n");
>
>
> Exit steps:
> ===== Screen dump =====
>
>  # ifconfig eth0  0.0.0.0  down
>  # ifconfig eth1  0.0.0.0  down
>  # ifconfig lo  0.0.0.0  down
>  # ip li de eth0
>  # ip li de eth1
>  # ifconfig -a
> ip6tnl0   Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00  
>           NOARP  MTU:1460  Metric:1
>           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0 
>           RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)
>
> lo        Link encap:Local Loopback  
>           inet addr:127.0.0.1  Mask:255.0.0.0
>           LOOPBACK  MTU:16436  Metric:1
>           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0 
>           RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)
>
> sit0      Link encap:IPv6-in-IPv4  
>           NOARP  MTU:1480  Metric:1
>           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0 
>           RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)
>
> tunl0     Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00  
>           NOARP  MTU:1480  Metric:1
>           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0 
>           RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)
>
>  # ps
>   PID USER       VSZ STAT COMMAND
>     1 root     12412 S    /usr/lib64/lxc/lxc-init -- /var/bin/init
>     2 root      4540 S    /bin/ash /var/bin/init
>     7 root      6640 S    inetd
>     8 root      4544 S    /bin/ash
>    26 root      4544 R    ps
>  # lsmod 
> Module                  Size  Used by    Not tainted
> macvlan                 8709  0 
> pcnet32                29549  0 
> tg3                   112093  0 
> libphy                 21043  1 tg3
>  # kill 7 2
>  # ps
>   PID USER       VSZ STAT COMMAND
>     1 root     12412 S    /usr/lib64/lxc/lxc-init -- /var/bin/init
>     8 root      4544 S    /bin/ash
>    28 root      4544 R    ps
>  # exit  ( here is the exit from netns  )
>  # ipip_exit_net(enter)
> ipip_exit_net(1)
> ipip_exit_net(2)
> ------------[ cut here ]------------
> WARNING: at /home/hans/evip/kvm/net-next-2.6/kernel/sysctl.c:1953
>   unregister_sysctl_table+0xc7/0xf9()

This warning is caused by removing the parent directory
before the child in the sysctl tables.  Not strictly fatal but
it is a problem.  It may be worth looking at which sysctl
tables ipip registers to see if we can rectify this.

> Hardware name: Bochs
> Modules linked in: macvlan pcnet32 tg3 libphy
> Pid: 5, comm: kworker/u:0 Not tainted 2.6.36-rc3+ #7
> Call Trace:
>  [<ffffffff8103e281>] warn_slowpath_common+0x85/0x9d
>  [<ffffffff8103e2b3>] warn_slowpath_null+0x1a/0x1c
>  [<ffffffff81045e64>] unregister_sysctl_table+0xc7/0xf9
>  [<ffffffff812c86a5>] neigh_sysctl_unregister+0x27/0x3f
>  [<ffffffff81342108>] addrconf_ifdown+0x415/0x45e
>  [<ffffffff81342b98>] addrconf_notify+0x756/0x7fe
>  [<ffffffff812cacfb>] ? neigh_ifdown+0xc3/0xd4
>  [<ffffffff813622b3>] ? ip6mr_device_event+0x8d/0x9e
>  [<ffffffff8105eddb>] notifier_call_chain+0x37/0x63
>  [<ffffffff8105ee8b>] raw_notifier_call_chain+0x14/0x16
>  [<ffffffff812c15c7>] call_netdevice_notifiers+0x4a/0x4f
>  [<ffffffff812c1c1b>] rollback_registered_many+0x121/0x208
>  [<ffffffff812c1d1d>] unregister_netdevice_many+0x1b/0x71
>  [<ffffffff81324209>] ipip_exit_net+0xea/0x11a
>  [<ffffffff812bc941>] ? cleanup_net+0x0/0x198
>  [<ffffffff812bc2cf>] ops_exit_list+0x2a/0x5b
>  [<ffffffff812bca39>] cleanup_net+0xf8/0x198
>  [<ffffffff810568c7>] process_one_work+0x2a2/0x44d
>  [<ffffffff81056e35>] worker_thread+0x1db/0x34e
>  [<ffffffff81056c5a>] ? worker_thread+0x0/0x34e
>  [<ffffffff8105a030>] kthread+0x82/0x8a
>  [<ffffffff81003954>] kernel_thread_helper+0x4/0x10
>  [<ffffffff81059fae>] ? kthread+0x0/0x8a
>  [<ffffffff81003950>] ? kernel_thread_helper+0x0/0x10
> ---[ end trace 939b5185219f32e7 ]---
> ipip_exit_net(3)
> ipip_exit_net(exit)
> unregister_netdevice: waiting for lo to become free. Usage count = 4
> unregister_netdevice: waiting for lo to become free. Usage count = 4
> unregister_netdevice: waiting for lo to become free. Usage count = 4

Nasty. Someone has left a reference lying around to one of the network
devices.  It is a reference that we can transfer to the loopback device
at device exit time, but we never drop the reference and so the loopback
interface never frees up.

Ouch!

There is the painful method of instrumenting of dev_hold and dev_release
that may give you a clue.  It may also be worth seeing which kinds of
device reference we transfer from the loopback device when a device
exits.

Eric

^ permalink raw reply

* Re: [PATCH 1/2] r8169: allocate with GFP_KERNEL flag when able to sleep
From: Stanislaw Gruszka @ 2010-10-08 16:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Francois Romieu, netdev
In-Reply-To: <1286550247.2959.444.camel@edumazet-laptop>

On Fri, Oct 08, 2010 at 05:04:07PM +0200, Eric Dumazet wrote:
> Le vendredi 08 octobre 2010 à 16:52 +0200, Stanislaw Gruszka a écrit :
> > On Fri, Oct 08, 2010 at 04:25:00PM +0200, Stanislaw Gruszka wrote:
> > > We have fedora bug report where driver fail to initialize after
> > > suspend/resume because of memory allocation errors:
> > > https://bugzilla.redhat.com/show_bug.cgi?id=629158
> > 
> > There is also one more thing to do regarding above. Calltraces from bug
> > reports, shows that order 3 allocation fail. On arch with 4kB pages,
> > order 3 mean 32kB allocation. We want to alloc 16kB, but there is also
> > internal sk_buff data what make that we exceed the boundary and take
> > 32kB from allocator, getting almost 50% wastage.
> > 
> 
> Or its only an 1460+overhead allocation, and SLUB uses order-3 pages to
> satisfy 2048 bytes allocations.

Rather not, trace show failure in rtl8169_rx_fill, where we allocate rx
buffers and these are 16kB big by default.

> Switch to SLAB -> no more problem ;)

yeh, I wish to, but fedora use SLUB because of some debugging
capabilities. 

> > To fix we can use similar method as in niu or iwlwifi drivers, alloc
> > pages directly form buddy allocator and attach them to skb (by
> > skb_add_rx_frag for example). I'm going to prepare such patch, but
> > I have one doubt, what happens if page size in system is bigger
> > than 16kB, should I care about such case? 
> 
> Seems tricky. Should we patch all drivers to do something like that ?

I think, only on these drivers which do alloc_skb(n*PAGE_SIZE).
As alternative we can be smarter in alloc_skb.

Stanislaw
> 
> 
> 

^ permalink raw reply

* Re: BUG ? ipip unregister_netdevice_many()
From: Daniel Lezcano @ 2010-10-08 15:53 UTC (permalink / raw)
  To: Hans Schillstrom; +Cc: Eric W. Biederman, netdev@vger.kernel.org
In-Reply-To: <201010081428.37639.hans.schillstrom@ericsson.com>

On 10/08/2010 02:28 PM, Hans Schillstrom wrote:
> Hi Eric,
> Any advice how to trace this down ?
> This rollback_registered_many() seems to have on the lists before...
> All IPv4 and IPv6 tunnels causes this crash, all you have to do is load the tunnel module(s)
> enter a new ns and exit from it.
>
> Have not tested any more devices than tunnels,
> I did an "ip link delete" on my macvlans before exiting the ns.
>    

Ah ! I succeed to reproduce it.
It does not appear immediately in fact.

I am trying to simplify the configuration but I am falling in the bug I 
talked about in the previous email.

> snip
>    
>>   # ------------[ cut here ]------------
>> WARNING: at /home/hans/evip/kvm/net-next-2.6/kernel/sysctl.c:1953 unregister_sysctl_table+0xc7/0xf9()
>> Hardware name: Bochs
>> Modules linked in: macvlan ip6_tunnel tunnel6 pcnet32 tg3 libphy
>> Pid: 5, comm: kworker/u:0 Not tainted 2.6.36-rc3 #2
>> Call Trace:
>>   [<ffffffff8103e281>] warn_slowpath_common+0x85/0x9d
>>   [<ffffffff8103e2b3>] warn_slowpath_null+0x1a/0x1c
>>   [<ffffffff81045e64>] unregister_sysctl_table+0xc7/0xf9
>>   [<ffffffff812c86a5>] neigh_sysctl_unregister+0x27/0x3f
>>   [<ffffffff81340c75>] addrconf_ifdown+0x415/0x45e
>>   [<ffffffff81341705>] addrconf_notify+0x756/0x7fe
>>   [<ffffffff812cacfb>] ? neigh_ifdown+0xc3/0xd4
>>   [<ffffffff81360eb3>] ? ip6mr_device_event+0x8d/0x9e
>>   [<ffffffff8105eddb>] notifier_call_chain+0x37/0x63
>>   [<ffffffff8105ee8b>] raw_notifier_call_chain+0x14/0x16
>>   [<ffffffff812c15c7>] call_netdevice_notifiers+0x4a/0x4f
>>   [<ffffffff812c1c1b>] rollback_registered_many+0x121/0x208
>>   [<ffffffff812c1d1d>] unregister_netdevice_many+0x1b/0x71
>>   [<ffffffffa0047244>] ip6_tnl_exit_net+0xa4/0xb8 [ip6_tunnel]
>>   [<ffffffff812bc941>] ? cleanup_net+0x0/0x198
>>   [<ffffffff812bc2cf>] ops_exit_list+0x2a/0x5b
>>   [<ffffffff812bca39>] cleanup_net+0xf8/0x198
>>   [<ffffffff810568c7>] process_one_work+0x2a2/0x44d
>>   [<ffffffff81056e35>] worker_thread+0x1db/0x34e
>>   [<ffffffff81056c5a>] ? worker_thread+0x0/0x34e
>>   [<ffffffff8105a030>] kthread+0x82/0x8a
>>   [<ffffffff81003954>] kernel_thread_helper+0x4/0x10
>>   [<ffffffff81059fae>] ? kthread+0x0/0x8a
>>   [<ffffffff81003950>] ? kernel_thread_helper+0x0/0x10
>> ---[ end trace eb3bc950cf9a8748 ]---
>> unregister_netdevice: waiting for lo to become free. Usage count = 4
>> unregister_netdevice: waiting for lo to become free. Usage count = 4
>> unregister_netdevice: waiting for lo to become free. Usage count = 4
>>      
>
>    


^ permalink raw reply

* Re: Linux 2.6.36-rc7
From: James Bottomley @ 2010-10-08 15:05 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: Linus Torvalds, Linux Kernel Mailing List, Russell King,
	David Miller, netdev, John W. Linville, Michal Marek,
	Dmitry Torokhov
In-Reply-To: <20101007114938.ad3d2c76.sfr@canb.auug.org.au>

On Thu, 2010-10-07 at 11:49 +1100, Stephen Rothwell wrote:
> Hi Linus,
> 
> On Wed, 6 Oct 2010 14:45:13 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> > This should be the last -rc, I'm not seeing any reason to keep
> > delaying a real release. There was still more changes to
> > drivers/gpu/drm than I really would have hoped for, but they all look
> > harmless and good. Famous last words.
> 
> I have no idea how critical any of this stuff is, but linux-next contain
> the following in it's "current" trees i.e. stuff that is supposed to go
> into 2.6.36.  These are from the arm-current, scsi-rc-fixes, net-current,
> wireless-current, kbuild-current, input-current and ide-curent trees
> (contacts cc'd).

The SCSI rc-fixes stuff is critical if you run into the bugs, but the
bugs are fairly rare cases for most people.  I'd still like to get them
in, though (and I have another 3 rc fixes candidates going through the
test pipeline).

James

^ permalink raw reply

* Re: [PATCH 1/2] r8169: allocate with GFP_KERNEL flag when able to sleep
From: Eric Dumazet @ 2010-10-08 15:04 UTC (permalink / raw)
  To: Stanislaw Gruszka; +Cc: Francois Romieu, netdev
In-Reply-To: <20101008145256.GB10393@redhat.com>

Le vendredi 08 octobre 2010 à 16:52 +0200, Stanislaw Gruszka a écrit :
> On Fri, Oct 08, 2010 at 04:25:00PM +0200, Stanislaw Gruszka wrote:
> > We have fedora bug report where driver fail to initialize after
> > suspend/resume because of memory allocation errors:
> > https://bugzilla.redhat.com/show_bug.cgi?id=629158
> 
> There is also one more thing to do regarding above. Calltraces from bug
> reports, shows that order 3 allocation fail. On arch with 4kB pages,
> order 3 mean 32kB allocation. We want to alloc 16kB, but there is also
> internal sk_buff data what make that we exceed the boundary and take
> 32kB from allocator, getting almost 50% wastage.
> 

Or its only an 1460+overhead allocation, and SLUB uses order-3 pages to
satisfy 2048 bytes allocations.

# grep 2048 /proc/slabinfo 
kmalloc-2048        8664   8752   2048   16    8 : tunables    0    0
0 : slabdata    547    547      0


8 in the <pagesperslab> column just says that : order-3 pages, even for
small allocations.

Switch to SLAB -> no more problem ;)


> To fix we can use similar method as in niu or iwlwifi drivers, alloc
> pages directly form buddy allocator and attach them to skb (by
> skb_add_rx_frag for example). I'm going to prepare such patch, but
> I have one doubt, what happens if page size in system is bigger
> than 16kB, should I care about such case? 

Seems tricky. Should we patch all drivers to do something like that ?




^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox