Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH linux-next v2 1/2] irq: Add CPU mask affinity hint
From: Peter P Waskiewicz Jr @ 2010-04-30 18:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: davem@davemloft.net, arjan@linux.jf.intel.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <alpine.LFD.2.00.1004301249540.2951@localhost.localdomain>

On Fri, 30 Apr 2010, Thomas Gleixner wrote:

> On Fri, 30 Apr 2010, Peter P Waskiewicz Jr wrote:
>
>> This patch adds a cpumask affinity hint to the irq_desc
>> structure, along with a registration function and a read-only
>> proc entry for each interrupt.
>>
>> This affinity_hint handle for each interrupt can be used by
>> underlying drivers that need a better mechanism to control
>> interrupt affinity.  The underlying driver can register a
>> cpumask for the interrupt, which will allow the driver to
>> provide the CPU mask for the interrupt to anything that
>> requests it.  The intent is to extend the userspace daemon,
>> irqbalance, to help hint to it a preferred CPU mask to balance
>> the interrupt into.
>>
>> Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
>> ---
>>
>>  include/linux/interrupt.h |   13 +++++++++++++
>>  include/linux/irq.h       |    1 +
>>  kernel/irq/manage.c       |   28 ++++++++++++++++++++++++++++
>>  kernel/irq/proc.c         |   33 +++++++++++++++++++++++++++++++++
>>  4 files changed, 75 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
>> index 75f3f00..9c9ea2a 100644
>> --- a/include/linux/interrupt.h
>> +++ b/include/linux/interrupt.h
>> @@ -209,6 +209,9 @@ extern int irq_set_affinity(unsigned int irq, const struct cpumask *cpumask);
>>  extern int irq_can_set_affinity(unsigned int irq);
>>  extern int irq_select_affinity(unsigned int irq);
>>
>> +extern int irq_register_affinity_hint(unsigned int irq,
>> +                                      const struct cpumask *m);
>
> I think we can do with a single funtion irq_set_affinity_hint() and
> let the caller set the pointer to NULL.

Ok, I've been running into some issues.  If CONFIG_CPUMASK_OFFSTACK is not 
set, then cpumask_var_t structs are single-element arrays that cannot be 
NULL'd out.  I'm pretty sure I need to keep the unregister part of the 
API.  Thoughts?

>> +	raw_spin_lock_irqsave(&desc->lock, flags);
>> +	if (desc->affinity_hint) {
>> +		seq_cpumask(m, desc->affinity_hint);
>
>  Please make a local copy under desc->mask and do the seq_cpumask()
>  stuff on the local copy outside of desc->lock

I just looked at the original show_affinity function, and it does not grab 
desc->lock before copying mask out of desc.  Should I follow that model, 
or should I fix that function to honor desc->lock?

-PJ

^ permalink raw reply

* Re: ixgbe and mac-vlans problem
From: Arnd Bergmann @ 2010-04-30 18:00 UTC (permalink / raw)
  To: Ben Greear; +Cc: NetDev, Patrick McHardy
In-Reply-To: <4BDA07DB.8020206@candelatech.com>

On Friday 30 April 2010 00:27:39 Ben Greear wrote:
> Basically, we create 50 mac-vlans, with sequential MAC addresses and sequential
> IP addresses, and set up ip rules properly.
> 
> The issue is that only 10 or so of the mac-vlans receive other than
> broadcast packets.  The ixgbe NIC doesn't show PROMISC mode.

I just took a brief look at the driver and noticed that 82599 should
be able to handle 128 entries before going into promisc mode, while
82598 (the same driver) does 16.

Maybe the logic for >16 entries is wrong, so you could try forcing
hw->mac.num_rar_entries to 16 for 82599 as well.

	Arnd

^ permalink raw reply

* Re: [PATCH 1/3] ptp: Added a brand new class driver for ptp clocks.
From: Wolfgang Grandegger @ 2010-04-30 17:58 UTC (permalink / raw)
  To: Richard Cochran; +Cc: netdev
In-Reply-To: <20100429091936.GA6703@riccoc20.at.omicron.at>

Hi Richard,

Richard Cochran wrote:
> This patch adds an infrastructure for hardware clocks that implement
> IEEE 1588, the Precision Time Protocol (PTP). A class driver offers a
> registration method to particular hardware clock drivers. Each clock is
> exposed to user space as a character device with ioctls that allow tuning
> of the PTP clock.
> 
> Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
> ---
>  Documentation/ptp/ptp.txt        |   78 ++++++++++
>  Documentation/ptp/testptp.c      |  130 ++++++++++++++++
>  Documentation/ptp/testptp.mk     |   33 ++++
>  drivers/Kconfig                  |    2 +
>  drivers/Makefile                 |    1 +
>  drivers/ptp/Kconfig              |   26 ++++
>  drivers/ptp/Makefile             |    5 +
>  drivers/ptp/ptp_clock.c          |  302 ++++++++++++++++++++++++++++++++++++++
>  include/linux/Kbuild             |    1 +
>  include/linux/ptp_clock.h        |   37 +++++

ptp_clock.h should probably be added to "include/linux/Kbuild".

Wolfgang.

^ permalink raw reply

* Re: [PATCH] tcp: SO_TIMESTAMP implementation for TCP
From: David Miller @ 2010-04-30  6:39 UTC (permalink / raw)
  To: therbert; +Cc: netdev
In-Reply-To: <alpine.DEB.1.00.1004292246440.12776@pokey.mtv.corp.google.com>

From: Tom Herbert <therbert@google.com>
Date: Thu, 29 Apr 2010 23:07:54 -0700 (PDT)

> Implement SO_TIMESTAMP{NS} for TCP.  When this socket option is enabled
> on a TCP socket, a timestamp for received data can be returned in the
> ancillary data of a recvmsg with control message type SCM_TIMESTAMP{NS}.
> The timestamp chosen is that of the skb most recently received from
> which data was copied.  This is useful in debugging and timing
> network operations.
> 
> Signed-off-by: Tom Herbert <therbert@google.com>

That's not what you're implementing here.

You're only updating the socket timestamp if the SKB passed into
the update function has a more recent timestamp.

There is nothing that says the timestamps have to be increasing and
with retransmits and such if it were me I would want to see the real
timestamp even if it was earlier than the most recently reported
timestamp.

I don't know, I really don't like this feature at all.  SO_TIMESTAMP
is really meant for datagram oriented sockets, where there is a
clearly defined "packet" whose timestamp you get.  A TCP receive can
involve hundreds of tiny packets so the timestamp can lack any
meaning.

All these new checks and branches for a feature of questionable value.
If you can modify you apps to grab this information you can also probe
for the information using external probing tools.

Sorry, I don't think I'll be applying this.

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: Eric Dumazet @ 2010-04-29 13:21 UTC (permalink / raw)
  To: hadi
  Cc: Changli Gao, David Miller, therbert, shemminger, netdev,
	Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272547061.4258.174.camel@bigi>

Le jeudi 29 avril 2010 à 09:17 -0400, jamal a écrit :

> Could we have some stat in there that shows IPIs being produced? I think
> it would help to at least observe any changes over variety of tests.
> I did try to patch my system during the first few tests to record IPIs
> but it seems to make more sense to have it as a perf stat.
> 
> > Even with 200.000 IPI per second, 'perf top -C CPU_IPI_sender' shows
> > that sending IPI is very cheap (maybe ~1% of cpu cycles)
> > 
> > # Samples: 32033467127
> > #
> 
> One thing i observed is our profiles seem different. Could you send me
> your .config for a single nehalem and i will try to go as close as
> possible to it? I have a sky2 instead of bnx - but i suspect everything
> else will be very similar...
> I apologize i dont have much time to look into details - but what i can
> do is test at least.

I'am going to redo some test on my 'old machine', with tg3 driver.

You could try following program :


#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>

struct softnet_stat_vals {
	int flip;
	unsigned int tab[2][10];
};

int read_file(struct softnet_stat_vals *v)
{
	char buffer[1024];
	FILE *F = fopen("/proc/net/softnet_stat", "r");

	v->flip ^= 1;
	if (!F)
		return -1;

	memset(v->tab[v->flip], 0, 10 * sizeof(unsigned int));
	while (fgets(buffer, sizeof(buffer), F)) {
		int i, pos = 0;
		unsigned int val;
	
		for (i = 0; ;) {
			if (sscanf(buffer + pos, "%08x", &val) != 1) break;
			v->tab[v->flip][i] += val;
			pos += 9;
			if (++i == 10)
				break;
			}
		}
	fclose(F);

}


int main(int argc, char *argv[])
{
	struct softnet_stat_vals *v = calloc(sizeof(struct softnet_stat_vals), 1);
	
	read_file(v);
	for (;;) {
		sleep(1);
		read_file(v);
		printf("%u rps\n", v->tab[v->flip][9] - v->tab[v->flip^1][9]);
	}
}



^ permalink raw reply

* [PATCH 0/3] [RFC] [v2] ptp: IEEE 1588 clock support
From: Richard Cochran @ 2010-04-29  9:19 UTC (permalink / raw)
  To: netdev

* Patch ChangeLog
** v2
   - Changed clock list from a static array into a dynamic list. Also,
     use a bitmap to manage the clock's minor numbers.
   - Replaced character device semaphore with a mutex.
   - Drop .ko from module names in Kbuild help.
   - Replace deprecated unifdef-y with header-y for user space header file.
   - Gianfar driver now gets parameters from device tree.
   - Added API documentation to Documentation/ptp/ptp.txt, with links
     to both of the ptpd patches on sourceforge.

* Preface

Now and again there has been some talk on this list of adding PTP
support into Linux. One part of the picture is already in place, the
SO_TIMESTAMPING API for hardware time stamping. It has been pointed
out that this API is not perfect, however, it is good enough for many
real world uses of IEEE 1588. The second needed part has not, AFAICT,
ever been addressed.

Here I offer an early draft of an idea how to bring the missing
functionality into Linux. I don't yet have all of the features
implemented, as described below. Still I would like to get your
feedback concerning this idea before getting too far into it. I do
have all of the hardware mentioned at hand, so I have a good idea that
the proposed API covers the features of those clocks.

Thanks in advance for your comments,

Richard


Richard Cochran (3):
  ptp: Added a brand new class driver for ptp clocks.
  ptp: Added a clock that uses the Linux system time.
  ptp: Added a clock that uses the eTSEC found on the MPC85xx.

 Documentation/ptp/ptp.txt             |   78 +++++++++
 Documentation/ptp/testptp.c           |  130 ++++++++++++++
 Documentation/ptp/testptp.mk          |   33 ++++
 arch/powerpc/boot/dts/mpc8313erdb.dts |   14 ++
 arch/powerpc/boot/dts/p2020ds.dts     |   13 ++
 arch/powerpc/boot/dts/p2020rdb.dts    |   14 ++
 drivers/Kconfig                       |    2 +
 drivers/Makefile                      |    1 +
 drivers/net/Makefile                  |    1 +
 drivers/net/gianfar_ptp.c             |  308 +++++++++++++++++++++++++++++++++
 drivers/net/gianfar_ptp_reg.h         |  107 ++++++++++++
 drivers/ptp/Kconfig                   |   51 ++++++
 drivers/ptp/Makefile                  |    6 +
 drivers/ptp/ptp_clock.c               |  302 ++++++++++++++++++++++++++++++++
 drivers/ptp/ptp_linux.c               |  122 +++++++++++++
 include/linux/Kbuild                  |    1 +
 include/linux/ptp_clock.h             |   37 ++++
 include/linux/ptp_clock_kernel.h      |  134 ++++++++++++++
 kernel/time/ntp.c                     |    2 +
 19 files changed, 1356 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/ptp/ptp.txt
 create mode 100644 Documentation/ptp/testptp.c
 create mode 100644 Documentation/ptp/testptp.mk
 create mode 100644 drivers/net/gianfar_ptp.c
 create mode 100644 drivers/net/gianfar_ptp_reg.h
 create mode 100644 drivers/ptp/Kconfig
 create mode 100644 drivers/ptp/Makefile
 create mode 100644 drivers/ptp/ptp_clock.c
 create mode 100644 drivers/ptp/ptp_linux.c
 create mode 100644 include/linux/ptp_clock.h
 create mode 100644 include/linux/ptp_clock_kernel.h


^ permalink raw reply

* Re: ixgbe and mac-vlans problem
From: Patrick McHardy @ 2010-04-30 12:24 UTC (permalink / raw)
  To: Ben Greear; +Cc: NetDev
In-Reply-To: <4BDA07DB.8020206@candelatech.com>

Ben Greear wrote:
> Hello!
> 
> We are seeing a strange problem with mac-vlans on top of an 82599 NIC
> (ixgbe)
> on a 2.6.31.y (plus patches) kernel.
> 
> Basically, we create 50 mac-vlans, with sequential MAC addresses and
> sequential
> IP addresses, and set up ip rules properly.
> 
> The issue is that only 10 or so of the mac-vlans receive other than
> broadcast packets.  The ixgbe NIC doesn't show PROMISC mode.

Not showing PROMISC is normal since this is going into promisc
when too many unicast addresses are configured is handled internally
by the drivers.

> If we manually turn on PROMISC mode on the physical ixgbe interface,
> then everything works properly.  I'm guessing the ixgbe NIC must
> have some filters that mac-vlan is trying to enable w/out turning on
> full PROMISC mode.  Is there any easy way to debug these filters
> to see if they are being properly configured?

Adding "#define DEBUG" to ixgbe_common.c should print some information
when unicast addresses are added.

^ permalink raw reply

* Re: [net-2.6 PATCH] e1000e: enable/disable ASPM L0s and L1 and ERT according to hardware errata
From: Anton Blanchard @ 2010-04-29  7:46 UTC (permalink / raw)
  To: Jeff Kirsher; +Cc: davem, netdev, gospo, Matthew Garret, Bruce Allan
In-Reply-To: <20100427133232.25490.92973.stgit@localhost.localdomain>


Hi,

> From: Bruce Allan <bruce.w.allan@intel.com>
> 
> Prompted by a previous patch submitted by Matthew Garret <mjg@redhat.com>,
> further digging into errata documentation reveals the current enabling or
> disabling of ASPM L0s and L1 states for certain parts supported by this
> driver are incorrect.  82571 and 82572 should always disable L1.  For
> standard frames, 82573/82574/82583 can enable L1 but L0s must be disabled,
> and for jumbo frames 82573/82574 must disable L1.  This allows for some
> parts to enable L1 in certain configurations leading to better power
> savings.
> 
> Also according to the same errata, Early Receive (ERT) should be disabled
> on 82573 when using jumbo frames.

This oopses on one of my ppc64 boxes with a NULL pointer (0x4a):

Unable to handle kernel paging request for data at address 0x0000004a
Faulting instruction address: 0xc0000000004d2f1c
cpu 0xe: Vector: 300 (Data Access) at [c000000bec1833a0]
    pc: c0000000004d2f1c: .e1000e_disable_aspm+0xe0/0x150
    lr: c0000000004d2f0c: .e1000e_disable_aspm+0xd0/0x150
   dar: 4a

[c000000bec1836d0] c00000000069b9d8 .e1000_probe+0x84/0xe8c
[c000000bec1837b0] c000000000386d90 .local_pci_probe+0x4c/0x68
[c000000bec183840] c0000000003872ac .pci_device_probe+0xfc/0x148
[c000000bec183900] c000000000409e8c .driver_probe_device+0xe4/0x1d0
[c000000bec1839a0] c00000000040a024 .__driver_attach+0xac/0xf4
[c000000bec183a40] c000000000409124 .bus_for_each_dev+0x9c/0x10c
[c000000bec183b00] c000000000409c1c .driver_attach+0x40/0x60
[c000000bec183b90] c0000000004085dc .bus_add_driver+0x150/0x328
[c000000bec183c40] c00000000040a58c .driver_register+0x100/0x1c4
[c000000bec183cf0] c00000000038764c .__pci_register_driver+0x78/0x128

Seems like pdev->bus->self == NULL. I haven't touched pci in a long time
so I'm trying to remember what this means (no pcie bridge perhaps?)

The patch below fixes the oops for me.

Signed-off-by: Anton Blanchard <anton@samba.org>
---

Index: linux-2.6.34-rc5/drivers/net/e1000e/netdev.c
===================================================================
--- linux-2.6.34-rc5.orig/drivers/net/e1000e/netdev.c	2010-04-29 00:10:58.000000000 -0500
+++ linux-2.6.34-rc5/drivers/net/e1000e/netdev.c	2010-04-29 02:20:50.000000000 -0500
@@ -4633,6 +4633,9 @@
 	reg16 &= ~state;
 	pci_write_config_word(pdev, pos + PCI_EXP_LNKCTL, reg16);
 
+	if (!pdev->bus->self)
+		return;
+
 	pos = pci_pcie_cap(pdev->bus->self);
 	pci_read_config_word(pdev->bus->self, pos + PCI_EXP_LNKCTL, &reg16);
 	reg16 &= ~state;

^ permalink raw reply

* Re: [patch 03/13] eeprom_93cx6: Add data direction control.
From: Jean Delvare @ 2010-04-30 11:12 UTC (permalink / raw)
  To: Ben Dooks; +Cc: netdev, tristram.ha, support, Wolfram Sang, Linux Kernel
In-Reply-To: <20100429231739.749602911@fluff.org.uk>

Hi Ben,

On Fri, 30 Apr 2010 00:16:24 +0100, Ben Dooks wrote:
> Some devices need to know if the data is to be output or read, so add a
> data direction into the eeprom structure to tell the driver whether the
> data line should be driven.
> 
> The user in this case is the Micrel KS8851 which has a direction
> control for the EEPROM data line and thus needs to know whether
> to drive it (writing) or to tristate it for receiving.
> 
> Signed-off-by: Ben Dooks <ben@simtec.co.uk>
> Cc: Wolfram Sang <w.sang@pengutronix.de>
> Cc: Jean Delvare <khali@linux-fr.org>
> Cc: Linux Kernel <linux-kernel@vger.kernel.org>

I don't know anything about these driver and device, so don't expect
any input from me.

-- 
Jean Delvare

^ permalink raw reply

* Re: [PATCH v6] net: batch skb dequeueing from softnet input_pkt_queue
From: Eric Dumazet @ 2010-04-29 19:12 UTC (permalink / raw)
  To: Andi Kleen, Andi Kleen
  Cc: hadi, Changli Gao, David S. Miller, Tom Herbert,
	Stephen Hemminger, netdev, Andi Kleen, lenb, arjan
In-Reply-To: <20100429182347.GA8512@gargoyle.fritz.box>

Le jeudi 29 avril 2010 à 20:23 +0200, Andi Kleen a écrit :
> On Thu, Apr 29, 2010 at 07:56:12PM +0200, Eric Dumazet wrote:
> > Le jeudi 29 avril 2010 à 19:42 +0200, Andi Kleen a écrit :
> > > > Andi, what do you think of this one ?
> > > > Dont we have a function to send an IPI to an individual cpu instead ?
> > > 
> > > That's what this function already does. You only set a single CPU 
> > > in the target mask, right?
> > > 
> > > IPIs are unfortunately always a bit slow. Nehalem-EX systems have X2APIC
> > > which is a bit faster for this, but that's not available in the lower
> > > end Nehalems. But even then it's not exactly fast.
> > > 
> > > I don't think the IPI primitive can be optimized much. It's not a cheap 
> > > operation.
> > > 
> > > If it's a problem do it less often and batch IPIs.
> > > 
> > > It's essentially the same problem as interrupt mitigation or NAPI 
> > > are solving for NICs. I guess just need a suitable mitigation mechanism.
> > > 
> > > Of course that would move more work to the sending CPU again, but 
> > > perhaps there's no alternative. I guess you could make it cheaper it by
> > > minimizing access to packet data.
> > > 
> > > -Andi
> > 
> > Well, IPI are already batched, and rate is auto adaptative.
> > 
> > After various changes, it seems things are going better, maybe there is
> > something related to cache line trashing.
> > 
> > I 'solved' it by using idle=poll, but you might take a look at
> > clockevents_notify (acpi_idle_enter_bm) abuse of a shared and higly
> > contended spinlock...
> 
> acpi_idle_enter_bm should not be executed on a Nehalem, it's obsolete.
> If it does on your system something is wrong.
> 
> Ahh, that triggers a bell. There's one issue that if the remote CPU is in a very
> deep idle state it could take a long time to wake it up. Nehalem has deeper
> sleep states than earlier CPUs. When this happens the IPI sender will be slow
> too I believe.
> 
> Are the target CPUs idle? 
> 

Yes, mostly, but about 200.000 wakeups per second I would say...

If a cpu in deep state receives an IPI, process a softirq, should it
come back to deep state immediately, or should it wait for some
milliseconds ?

> Perhaps need to feed some information to cpuidle's governour to prevent this problem.
> 
> idle=poll is very drastic, better to limit to C1 
> 

How can I do this ?

Thanks !



^ permalink raw reply

* [PATCH 2/3] ptp: Added a clock that uses the Linux system time.
From: Richard Cochran @ 2010-04-29  9:19 UTC (permalink / raw)
  To: netdev

This PTP clock simply uses the NTP time adjustment system calls. The
driver can be used in systems that lack a special hardware PTP clock.

Signed-off-by: Richard Cochran <richard.cochran@omicron.at>
---
 drivers/ptp/Kconfig     |   12 +++++
 drivers/ptp/Makefile    |    1 +
 drivers/ptp/ptp_linux.c |  122 +++++++++++++++++++++++++++++++++++++++++++++++
 kernel/time/ntp.c       |    2 +
 4 files changed, 137 insertions(+), 0 deletions(-)
 create mode 100644 drivers/ptp/ptp_linux.c

diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index c80a25b..9390d44 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -23,4 +23,16 @@ config PTP_1588_CLOCK
 	  To compile this driver as a module, choose M here: the module
 	  will be called ptp_clock.
 
+config PTP_1588_CLOCK_LINUX
+	tristate "Linux system timer as PTP clock"
+	depends on PTP_1588_CLOCK
+	help
+	  This driver adds support for using the standard Linux time
+	  source as a PTP clock. This clock is only useful if your PTP
+	  programs are using software time stamps for the PTP Ethernet
+	  packets.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called ptp_linux.
+
 endmenu
diff --git a/drivers/ptp/Makefile b/drivers/ptp/Makefile
index b86695c..1651d52 100644
--- a/drivers/ptp/Makefile
+++ b/drivers/ptp/Makefile
@@ -3,3 +3,4 @@
 #
 
 obj-$(CONFIG_PTP_1588_CLOCK)		+= ptp_clock.o
+obj-$(CONFIG_PTP_1588_CLOCK_LINUX)	+= ptp_linux.o
diff --git a/drivers/ptp/ptp_linux.c b/drivers/ptp/ptp_linux.c
new file mode 100644
index 0000000..ac0ae6f
--- /dev/null
+++ b/drivers/ptp/ptp_linux.c
@@ -0,0 +1,122 @@
+/*
+ * PTP 1588 clock using the Linux system clock
+ *
+ * Copyright (C) 2010 OMICRON electronics GmbH
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ */
+#include <linux/device.h>
+#include <linux/err.h>
+#include <linux/hrtimer.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/timex.h>
+
+#include <linux/ptp_clock_kernel.h>
+
+static struct ptp_clock *linux_clock;
+
+DEFINE_SPINLOCK(adjtime_lock);
+
+static int ptp_linux_adjfreq(void *priv, s32 delta)
+{
+	struct timex txc;
+	s64 tmp = delta;
+	int err;
+	txc.freq = div_s64(tmp<<16, 1000);
+	txc.modes = ADJ_FREQUENCY;
+	err = do_adjtimex(&txc);
+	return err < 0 ? err : 0;
+}
+
+static int ptp_linux_adjtime(void *priv, struct timespec *ts)
+{
+	s64 delta;
+	ktime_t now;
+	struct timespec t2;
+	unsigned long flags;
+	int err;
+
+	delta = 1000000000LL * ts->tv_sec + ts->tv_nsec;
+
+	spin_lock_irqsave(&adjtime_lock, flags);
+
+	now = ktime_get_real();
+
+	now = delta < 0 ? ktime_sub_ns(now,-delta) : ktime_add_ns(now,delta);
+
+	t2 = ktime_to_timespec(now);
+
+	err = do_settimeofday(&t2);
+
+	spin_unlock_irqrestore(&adjtime_lock, flags);
+
+	return err;
+}
+
+static int ptp_linux_gettime(void *priv, struct timespec *ts)
+{
+	getnstimeofday(ts);
+	return 0;
+}
+
+static int ptp_linux_settime(void *priv, struct timespec *ts)
+{
+	return do_settimeofday(ts);
+}
+
+static int ptp_linux_enable(void *priv, struct ptp_clock_request rq, int on)
+{
+	/* We do not offer any ancillary features at all. */
+	return -EINVAL;
+}
+
+struct ptp_clock_info ptp_linux_caps = {
+	.owner		= THIS_MODULE,
+	.name		= "Linux timer",
+	.max_adj	= 512000,
+	.n_alarm	= 0,
+	.n_ext_ts	= 0,
+	.n_per_out	= 0,
+	.pps		= 0,
+	.priv		= NULL,
+	.adjfreq	= ptp_linux_adjfreq,
+	.adjtime	= ptp_linux_adjtime,
+	.gettime	= ptp_linux_gettime,
+	.settime	= ptp_linux_settime,
+	.enable		= ptp_linux_enable,
+};
+
+/* module operations */
+
+static void __exit ptp_linux_exit(void)
+{
+	ptp_clock_unregister(linux_clock);
+}
+
+static int __init ptp_linux_init(void)
+{
+	linux_clock = ptp_clock_register(&ptp_linux_caps);
+
+	return IS_ERR(linux_clock) ? PTR_ERR(linux_clock) : 0;
+}
+
+subsys_initcall(ptp_linux_init);
+module_exit(ptp_linux_exit);
+
+MODULE_AUTHOR("Richard Cochran <richard.cochran@omicron.at>");
+MODULE_DESCRIPTION("PTP clock using the Linux system timer");
+MODULE_LICENSE("GPL");
diff --git a/kernel/time/ntp.c b/kernel/time/ntp.c
index 7c0f180..efd1ff8 100644
--- a/kernel/time/ntp.c
+++ b/kernel/time/ntp.c
@@ -14,6 +14,7 @@
 #include <linux/timex.h>
 #include <linux/time.h>
 #include <linux/mm.h>
+#include <linux/module.h>
 
 /*
  * NTP timekeeping variables:
@@ -535,6 +536,7 @@ int do_adjtimex(struct timex *txc)
 
 	return result;
 }
+EXPORT_SYMBOL(do_adjtimex);
 
 static int __init ntp_tick_adj_setup(char *str)
 {
-- 
1.6.0.4


^ permalink raw reply related

* Re: [PATCH RFC: linux-next 1/2] irq: Add CPU mask affinity hint callback framework
From: Peter P Waskiewicz Jr @ 2010-04-29 21:29 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: davem@davemloft.net, arjan@linux.jf.intel.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <alpine.LFD.2.00.1004292237280.2951@localhost.localdomain>

On Thu, 2010-04-29 at 13:39 -0700, Thomas Gleixner wrote:
> On Thu, 29 Apr 2010, Peter P Waskiewicz Jr wrote:
> > On Thu, 2010-04-29 at 12:48 -0700, Thomas Gleixner wrote:
> > > Thinking more about it, I wonder whether you have a cpu_mask in your
> > > driver/device private data anyway. I bet you have :)
> > 
> > Well, at this point we don't, but nothing says we can't.
> 
> Somewhere you need to store that information in your driver, right ?

Yes.  But right now, storing a cpu_mask for an interrupt wouldn't buy us
anything since we have no mechanism to make use of it today.  :-)

I'll be putting the cpu_mask entry in our q_vector structure, which is
our abstraction of the MSI-X vector (it's where I have the hint struct
right now in patch 2/2 for the ixgbe driver).  It's a simple place to
stick it.

> > > So it should be sufficient to set a pointer to that cpu_mask in
> > > irq_desc and get rid of the callback completely.
> > 
> > So "register" would just assign the pointer, and "unregister" would make
> > sure to NULL the mask pointer out.  I like it.  It'll sure clean things
> > up too.
> 
> Yep, that'd be like the set_irq_chip() function. Just assign the
> pointer under desc->lock.
> 
> Thanks,
> 
> 	tglx



^ permalink raw reply

* Re: [PATCH 0/3] [RFC] ptp: IEEE 1588 clock support
From: Wolfgang Grandegger @ 2010-04-29 20:30 UTC (permalink / raw)
  To: Richard Cochran; +Cc: netdev
In-Reply-To: <20100429153401.GA26741@riccoc20.at.omicron.at>

Richard Cochran wrote:
> On Thu, Apr 29, 2010 at 02:02:59PM +0200, Wolfgang Grandegger wrote:
>> I realized two other netdev drivers already supporting PTP timestamping:
>> igb and bfin_mac. From the PTP developer point of view, the interface
>> looks rather complete to me and it works fine on my MPC8313 setup.
> 
> Do you know whether these two also have PTP clocks? If so, is the API
> that I suggested going to work for controlling those clocks, too?

Well, I don't really know that hardware in detail. I just browsed the
code, e.g:

http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git;a=commit;h=38c845c76
http://marc.info/?l=linux-netdev&m=126389931509102&w=2

But I believe that your interface is generic enough to support that
hardware as well. Would make sense to put the relevant people on CC and
also "ptpd@lists.infradead.org."

Wolfgang.

^ permalink raw reply

* Re: [PATCH 0/3] [RFC] ptp: IEEE 1588 clock support
From: Wolfgang Grandegger @ 2010-04-29 11:31 UTC (permalink / raw)
  To: Richard Cochran; +Cc: netdev
In-Reply-To: <20100429094208.GA6889@riccoc20.at.omicron.at>

Richard Cochran wrote:
> On Thu, Apr 29, 2010 at 11:24:24AM +0200, Wolfgang Grandegger wrote:
>> I used these interrupt number fixes as well but it was not necessary for
>> the actual net-next-2.6 tree. Need to check why? I remember some version
>> dependent re-mapping code.
> 
> I argued on the ppc list with Scott Wood about adding dts files, one
> for each of mpc8313 rev A, B, and C, but he advocated fixing this
> problem in uboot instead. Is the fix in uboot, or in the kernel?

It seems to be fixed in u-boot:

  commit 7120c888101952b7e61b9e54bb42370904aa0e68
  Author: Kim Phillips <kim.phillips@freescale.com>
  Date:   Mon Oct 12 11:06:19 2009 -0500

    mpc83xx: mpc8313 - handle erratum IPIC1 (TSEC IRQ number swappage)
    
    mpc8313e erratum IPIC1 swapped TSEC interrupt ID numbers on rev. 1
    h/w (see AN3545).  The base device tree in use has rev. 1 ID numbers,
    so if on Rev. 2 (and higher) h/w, we fix them up here.
    
    Signed-off-by: Kim Phillips <kim.phillips@freescale.com>
    Reviewed-by: Roland Lezuo <roland.lezuo@chello.at>

>> That's missing to get the PPS signal output. But it should probably go
>> to gianfar_ptp.c.
> 
> Well, this fix is specific to the mpc8313, but the gianfar_ptp driver
> is for all eTECs. For example, I have the ptp code running on the
> p2020rdb and p2020ds, too.
> 
> I don't think board fixups really belong in the PTP clock driver.
> 
> Just my 2 cents,

I see, fine for me if setting those bits does not harm.

Wolfgang. 

^ permalink raw reply

* Re: [PATCH] bonding: fix arp_validate on bonds inside a bridge
From: Jiri Bohac @ 2010-04-30 15:45 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: Jiri Bohac, bonding-devel, netdev
In-Reply-To: <26811.1272567443@death.nxdomain.ibm.com>

On Thu, Apr 29, 2010 at 11:57:23AM -0700, Jay Vosburgh wrote:
> 	This doesn't apply to the current net-next-2.6 (because
> skb_bond_should_drop was pulled out of line a few weeks ago); can you
> update the patch?

sure, here it goes:

----

bonding with arp_validate does not currently work when the
bonding master is part of a bridge. This is because
bond_arp_rcv() is registered as a packet type handler for ARP,
but before netif_receive_skb() processes the ptype_base hash
table, handle_bridge() is called and changes the skb->dev to
point to the bridge device.

This patch makes bonding_should_drop() call the bonding ARP
handler directly if a IFF_MASTER_NEEDARP flag is set on the
bonding master.  bond_register_arp() now only needs to set the
IFF_MASTER_NEEDARP flag.

We no longer need special ARP handling for inactive slaves, hence
IFF_SLAVE_NEEDARP is not needed.

skb_reset_network_header() and skb_reset_transport_header() need
to be called before the call to bonding_should_drop() because
bond_handle_arp() needs the offsets initialized.

As a side-effect, skb_bond_should_drop is #defined as 0
when CONFIG_BONDING is not set.

Signed-off-by: Jiri Bohac <jbohac@suse.cz>

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 85e813c..b149bfa 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1879,8 +1879,7 @@ int bond_release(struct net_device *bond_dev, struct net_device *slave_dev)
 	}
 
 	slave_dev->priv_flags &= ~(IFF_MASTER_8023AD | IFF_MASTER_ALB |
-				   IFF_SLAVE_INACTIVE | IFF_BONDING |
-				   IFF_SLAVE_NEEDARP);
+				   IFF_SLAVE_INACTIVE | IFF_BONDING);
 
 	kfree(slave);
 
@@ -2551,11 +2550,12 @@ static void bond_validate_arp(struct bonding *bond, struct slave *slave, __be32
 	}
 }
 
-static int bond_arp_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev)
+static void bond_handle_arp(struct sk_buff *skb)
 {
 	struct arphdr *arp;
 	struct slave *slave;
 	struct bonding *bond;
+	struct net_device *dev = skb->dev->master, *orig_dev = skb->dev;
 	unsigned char *arp_ptr;
 	__be32 sip, tip;
 
@@ -2576,9 +2576,8 @@ static int bond_arp_rcv(struct sk_buff *skb, struct net_device *dev, struct pack
 	bond = netdev_priv(dev);
 	read_lock(&bond->lock);
 
-	pr_debug("bond_arp_rcv: bond %s skb->dev %s orig_dev %s\n",
-		 bond->dev->name, skb->dev ? skb->dev->name : "NULL",
-		 orig_dev ? orig_dev->name : "NULL");
+	pr_debug("bond_handle_arp: bond: %s, master: %s, slave: %s\n",
+		bond->dev->name, dev->name, orig_dev->name);
 
 	slave = bond_get_slave_by_dev(bond, orig_dev);
 	if (!slave || !slave_do_arp_validate(bond, slave))
@@ -2623,8 +2622,7 @@ static int bond_arp_rcv(struct sk_buff *skb, struct net_device *dev, struct pack
 out_unlock:
 	read_unlock(&bond->lock);
 out:
-	dev_kfree_skb(skb);
-	return NET_RX_SUCCESS;
+	return;
 }
 
 /*
@@ -3506,23 +3504,12 @@ static void bond_unregister_lacpdu(struct bonding *bond)
 
 void bond_register_arp(struct bonding *bond)
 {
-	struct packet_type *pt = &bond->arp_mon_pt;
-
-	if (pt->type)
-		return;
-
-	pt->type = htons(ETH_P_ARP);
-	pt->dev = bond->dev;
-	pt->func = bond_arp_rcv;
-	dev_add_pack(pt);
+	bond->dev->priv_flags |= IFF_MASTER_NEEDARP;
 }
 
 void bond_unregister_arp(struct bonding *bond)
 {
-	struct packet_type *pt = &bond->arp_mon_pt;
-
-	dev_remove_pack(pt);
-	pt->type = 0;
+	bond->dev->priv_flags &= ~IFF_MASTER_NEEDARP;
 }
 
 /*---------------------------- Hashing Policies -----------------------------*/
@@ -4967,6 +4954,7 @@ static struct pernet_operations bond_net_ops = {
 	.size = sizeof(struct bond_net),
 };
 
+extern void (*bond_handle_arp_hook)(struct sk_buff *skb);
 static int __init bonding_init(void)
 {
 	int i;
@@ -4999,6 +4987,7 @@ static int __init bonding_init(void)
 	register_netdevice_notifier(&bond_netdev_notifier);
 	register_inetaddr_notifier(&bond_inetaddr_notifier);
 	bond_register_ipv6_notifier();
+	bond_handle_arp_hook = bond_handle_arp;
 out:
 	return res;
 err:
@@ -5019,6 +5008,7 @@ static void __exit bonding_exit(void)
 
 	rtnl_link_unregister(&bond_link_ops);
 	unregister_pernet_subsys(&bond_net_ops);
+	bond_handle_arp_hook = NULL;
 }
 
 module_init(bonding_init);
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index 2aa3367..64e0108 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -212,7 +212,6 @@ struct bonding {
 	struct   bond_params params;
 	struct   list_head vlan_list;
 	struct   vlan_group *vlgrp;
-	struct   packet_type arp_mon_pt;
 	struct   workqueue_struct *wq;
 	struct   delayed_work mii_work;
 	struct   delayed_work arp_work;
@@ -292,14 +291,12 @@ static inline void bond_set_slave_inactive_flags(struct slave *slave)
 	if (!bond_is_lb(bond))
 		slave->state = BOND_STATE_BACKUP;
 	slave->dev->priv_flags |= IFF_SLAVE_INACTIVE;
-	if (slave_do_arp_validate(bond, slave))
-		slave->dev->priv_flags |= IFF_SLAVE_NEEDARP;
 }
 
 static inline void bond_set_slave_active_flags(struct slave *slave)
 {
 	slave->state = BOND_STATE_ACTIVE;
-	slave->dev->priv_flags &= ~(IFF_SLAVE_INACTIVE | IFF_SLAVE_NEEDARP);
+	slave->dev->priv_flags &= ~IFF_SLAVE_INACTIVE;
 }
 
 static inline void bond_set_master_3ad_flags(struct bonding *bond)
diff --git a/include/linux/if.h b/include/linux/if.h
index 3a9f410..84ab2c8 100644
--- a/include/linux/if.h
+++ b/include/linux/if.h
@@ -63,7 +63,7 @@
 #define IFF_MASTER_8023AD	0x8	/* bonding master, 802.3ad. 	*/
 #define IFF_MASTER_ALB	0x10		/* bonding master, balance-alb.	*/
 #define IFF_BONDING	0x20		/* bonding master or slave	*/
-#define IFF_SLAVE_NEEDARP 0x40		/* need ARPs for validation	*/
+#define IFF_MASTER_NEEDARP 0x40		/* need ARPs for validation	*/
 #define IFF_ISATAP	0x80		/* ISATAP interface (RFC4214)	*/
 #define IFF_MASTER_ARPMON 0x100		/* bonding master, ARP mon in use */
 #define IFF_WAN_HDLC	0x200		/* WAN HDLC device		*/
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 40d4c20..fa27d16 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2162,6 +2162,7 @@ static inline void netif_set_gso_max_size(struct net_device *dev,
 	dev->gso_max_size = size;
 }
 
+#if defined(CONFIG_BONDING) || defined(CONFIG_BONDING_MODULE)
 extern int __skb_bond_should_drop(struct sk_buff *skb,
 				  struct net_device *master);
 
@@ -2172,6 +2173,9 @@ static inline int skb_bond_should_drop(struct sk_buff *skb,
 		return __skb_bond_should_drop(skb, master);
 	return 0;
 }
+#else
+#define skb_bond_should_drop(a, b) 0
+#endif
 
 extern struct pernet_operations __net_initdata loopback_net_ops;
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 100dcbd..2689ff0 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2734,6 +2734,10 @@ static inline void skb_bond_set_mac_by_master(struct sk_buff *skb,
 	}
 }
 
+#if defined(CONFIG_BONDING) || defined(CONFIG_BONDING_MODULE)
+void (*bond_handle_arp_hook)(struct sk_buff *skb);
+EXPORT_SYMBOL_GPL(bond_handle_arp_hook);
+
 /* On bonding slaves other than the currently active slave, suppress
  * duplicates except for 802.3ad ETH_P_SLOW, alb non-mcast/bcast, and
  * ARP on active-backup slaves with arp_validate enabled.
@@ -2753,11 +2757,13 @@ int __skb_bond_should_drop(struct sk_buff *skb, struct net_device *master)
 		skb_bond_set_mac_by_master(skb, master);
 	}
 
-	if (dev->priv_flags & IFF_SLAVE_INACTIVE) {
-		if ((dev->priv_flags & IFF_SLAVE_NEEDARP) &&
-		    skb->protocol == __cpu_to_be16(ETH_P_ARP))
-			return 0;
+	/* pass ARP frames directly to bonding
+	   before bridging or other hooks change them */
+	if ((master->priv_flags & IFF_MASTER_NEEDARP) &&
+	    skb->protocol == __cpu_to_be16(ETH_P_ARP))
+		bond_handle_arp_hook(skb);
 
+	if (dev->priv_flags & IFF_SLAVE_INACTIVE) {
 		if (master->priv_flags & IFF_MASTER_ALB) {
 			if (skb->pkt_type != PACKET_BROADCAST &&
 			    skb->pkt_type != PACKET_MULTICAST)
@@ -2772,6 +2778,7 @@ int __skb_bond_should_drop(struct sk_buff *skb, struct net_device *master)
 	return 0;
 }
 EXPORT_SYMBOL(__skb_bond_should_drop);
+#endif
 
 static int __netif_receive_skb(struct sk_buff *skb)
 {
@@ -2796,6 +2803,10 @@ static int __netif_receive_skb(struct sk_buff *skb)
 	if (!skb->skb_iif)
 		skb->skb_iif = skb->dev->ifindex;
 
+	skb_reset_network_header(skb);
+	skb_reset_transport_header(skb);
+	skb->mac_len = skb->network_header - skb->mac_header;
+
 	null_or_orig = NULL;
 	orig_dev = skb->dev;
 	master = ACCESS_ONCE(orig_dev->master);
@@ -2808,10 +2819,6 @@ static int __netif_receive_skb(struct sk_buff *skb)
 
 	__get_cpu_var(netdev_rx_stat).total++;
 
-	skb_reset_network_header(skb);
-	skb_reset_transport_header(skb);
-	skb->mac_len = skb->network_header - skb->mac_header;
-
 	pt_prev = NULL;
 
 	rcu_read_lock();
-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ


^ permalink raw reply related

* ixgbe and mac-vlans problem
From: Ben Greear @ 2010-04-29 22:27 UTC (permalink / raw)
  To: NetDev; +Cc: Patrick McHardy

Hello!

We are seeing a strange problem with mac-vlans on top of an 82599 NIC (ixgbe)
on a 2.6.31.y (plus patches) kernel.

Basically, we create 50 mac-vlans, with sequential MAC addresses and sequential
IP addresses, and set up ip rules properly.

The issue is that only 10 or so of the mac-vlans receive other than
broadcast packets.  The ixgbe NIC doesn't show PROMISC mode.

If we manually turn on PROMISC mode on the physical ixgbe interface,
then everything works properly.  I'm guessing the ixgbe NIC must
have some filters that mac-vlan is trying to enable w/out turning on
full PROMISC mode.  Is there any easy way to debug these filters
to see if they are being properly configured?

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

* Re: patch sysfs-implement-sysfs-tagged-directory-support.patch added to gregkh-2.6 tree
From: Serge E. Hallyn @ 2010-04-30 14:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Eric W. Biederman, Greg KH, bcrl, benjamin.thery, cornelia.huck,
	eric.dumazet, kay.sievers, netdev
In-Reply-To: <4BDA6C90.9010303@kernel.org>

Quoting Tejun Heo (tj@kernel.org):
> Hello,
> 
> On 04/30/2010 07:24 AM, Eric W. Biederman wrote:
> >>> I wish at least more comments are added before it goes mainline.  I
> >>> don't really understand the current form.
> >>
> >> Ok, that's fine with me, I'll pull it back out.
> > 
> > ?????
> > 
> > Tejun you have offered nothing constructive to the review, except looking
> > and saying you don't understand what is going on.
> 
> Eric, no need to get too touchy and you're right in part in saying all
> I'm saying is basically "I don't understand it" which is the same
> reason why I'm not nacking it and explicitly stated that I would be
> okay with the series going in if Greg/Kay would be okay with it.
> Again, about the same thing with the above comment, I was *wishing*
> for more comments *before it goes mainline*.

I'm not sure if you mean "more in-line comments" or more discussion.  If
you mean the latter, then I think the patch intro was deceptive as it has
gotten more acks than that - I acked the whole set, and I think it didn't
get more discussion than it did because it got much discussion in previous
versions.

This subset looks a bit mysterious because it offers the support for
tagged /sys/class/net, but that implementation, which clarifies why some
of this is done, comes in the later patches in Eric's set.  Can you please
jump to his tree, take a look at 
http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/ebiederm/linux-2.6.32-rc5-sysfs-enhancements.git;a=commit;h=e7468796a9756b28e0ab38eb021025bbd3712823
and let us know if that does not clarify?

Hmm, but looking back over the previous thread (Mar 31) I guess you
mean more in-line comments around the callbacks, presumably things
like class_dir_child_ns_type() and struct kobj_ns_type_operations
members?

It sounds like what you'd really like is to have any explicit mention
to namespaces pulled out of drivers/base (layering as you keep saying)?
But will there be a use for this outside of namespaces?  Does trying to
anticipate that fall into the category of over-abstraction?

-serge

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: Changli Gao @ 2010-04-29 12:12 UTC (permalink / raw)
  To: hadi
  Cc: Eric Dumazet, David Miller, therbert, shemminger, netdev,
	Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272540952.4258.161.camel@bigi>

On Thu, Apr 29, 2010 at 7:35 PM, jamal <hadi@cyberus.ca> wrote:
>
> Same here - even in my worst case scenario 88.5% of 750Kpps > 600Kpps.
> Attached is history results to make more sense of what i am saying:
> we have net-next kernels from apr14, apr23, apr23 with changlis change,
> apr28, apr28 with your change. What you'll see is non-rps (blue) gets
> better and rps (Orange) gets better slowly then by apr28 it is worse.

Did the number of IPIs increase in the apr28 test? The finial patch
with Eric's change may introduce more IPIs. And I am wondering why
23rdcl-non-rps is better than before. Maybe it is the side effect of
my patch: enlarge the netdev_max_backlog.


-- 
Regards，
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: sock_def_readable() and friends RCU conversion
From: Brian Bloniarz @ 2010-04-30 13:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: hadi, Changli Gao, David Miller, therbert, shemminger, netdev,
	Eilon Greenstein
In-Reply-To: <1272574909.2209.150.camel@edumazet-laptop>

Eric Dumazet wrote:
> Here is last 'patch of the day' for me ;)
> Next one will be able to coalesce wakeup calls (they'll be delayed at
> the end of net_rx_action(), like a patch I did last year to help
> multicast reception)
>
> vger seems to be down, I suspect I'll have to resend it later.
>
> [PATCH net-next-2.6] net: sock_def_readable() and friends RCU conversion
>
> sk_callback_lock rwlock actually protects sk->sk_sleep pointer, so we
> need two atomic operations (and associated dirtying) per incoming
> packet.
>   

This patch boots for me, I haven't noticed any strangeness yet.

I ran a few benchmarks (the multicast fan-out mcasttest.c
from last year, a few other things we have lying around).
I think I see a modest improvement from this and your other
2 packets. Presumably the big wins are where multiple cores
perform bh for the same socket, that's not the case in
these benchmarks. If it's appropriate:

Tested-by: Brian Bloniarz <bmb@athenacr.com>

> Next one will be able to coalesce wakeup calls (they'll be delayed at
> the end of net_rx_action(), like a patch I did last year to help
> multicast reception)

Keep em coming :)

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: sock_def_readable() and friends RCU conversion
From: Eric Dumazet @ 2010-04-30 17:26 UTC (permalink / raw)
  To: Brian Bloniarz
  Cc: hadi, Changli Gao, David Miller, therbert, shemminger, netdev,
	Eilon Greenstein
In-Reply-To: <4BDAE156.8070800@athenacr.com>

Le vendredi 30 avril 2010 à 09:55 -0400, Brian Bloniarz a écrit :

> 
> This patch boots for me, I haven't noticed any strangeness yet.
> 
> I ran a few benchmarks (the multicast fan-out mcasttest.c
> from last year, a few other things we have lying around).
> I think I see a modest improvement from this and your other
> 2 packets. Presumably the big wins are where multiple cores
> perform bh for the same socket, that's not the case in
> these benchmarks. If it's appropriate:
> 
> Tested-by: Brian Bloniarz <bmb@athenacr.com>
> 
> > Next one will be able to coalesce wakeup calls (they'll be delayed at
> > the end of net_rx_action(), like a patch I did last year to help
> > multicast reception)
> 
> Keep em coming :)

Thanks for testing !

Here is a respin of "net: relax dst refcnt in input path"
patch for net-next-2.6

Not ready for inclusion, but seems to work quite well on multicast
load : I get about 20% more packets on mcasttest

(Avoid atomic ops on dst entries on input path, and partly on forwading
path). On mccasttest, all sockets share same dst, so producer/consumers
all fight on a single cache line.

Old ref (for informations) :
http://kerneltrap.org/mailarchive/linux-netdev/2009/7/22/6248753

Not-Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

 include/linux/skbuff.h    |   45 +++++++++++++++++++++++++++++++++-
 include/net/dst.h         |   47 +++++++++++++++++++++++++++++++++---
 include/net/route.h       |    2 -
 include/net/sock.h        |    2 +
 net/bridge/br_netfilter.c |    2 -
 net/core/dev.c            |    3 ++
 net/core/skbuff.c         |    3 +-
 net/core/sock.c           |    6 ++++
 net/ipv4/arp.c            |    2 -
 net/ipv4/icmp.c           |    8 +++---
 net/ipv4/ip_forward.c     |    1 
 net/ipv4/ip_fragment.c    |    2 -
 net/ipv4/ip_input.c       |    2 -
 net/ipv4/ip_options.c     |   11 ++++----
 net/ipv4/netfilter.c      |    8 +++---
 net/ipv4/route.c          |   15 +++++++----
 net/ipv4/xfrm4_input.c    |    2 -
 net/ipv6/ip6_tunnel.c     |    2 -
 net/netfilter/nf_queue.c  |    2 +
 net/sched/sch_generic.c   |    2 -
 20 files changed, 136 insertions(+), 31 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 82f5116..6195bcf 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -414,16 +414,59 @@ struct sk_buff {
 
 #include <asm/system.h>
 
+/*
+ * skb might have a dst pointer attached, refcounted or not
+ * _skb_dst low order bit is set if refcount was taken
+ */
+#define SKB_DST_NOREF	1UL
+#define SKB_DST_PTRMASK	~(SKB_DST_NOREF)
+
+/**
+ * skb_dst - returns skb dst_entry
+ * @skb: buffer
+ *
+ * Returns skb dst_entry, regardless of reference taken or not.
+ */
 static inline struct dst_entry *skb_dst(const struct sk_buff *skb)
 {
-	return (struct dst_entry *)skb->_skb_dst;
+	return (struct dst_entry *)(skb->_skb_dst & SKB_DST_PTRMASK);
 }
 
+/**
+ * skb_dst_set - sets skb dst
+ * @skb: buffer
+ * @dst: dst entry
+ *
+ * Sets skb dst, assuming a reference was taken on dst and should
+ * be released by skb_dst_drop()
+ */
 static inline void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)
 {
 	skb->_skb_dst = (unsigned long)dst;
 }
 
+/**
+ * skb_dst_set_noref - sets skb dst, without a reference
+ * @skb: buffer
+ * @dst: dst entry
+ *
+ * Sets skb dst, assuming a reference was _not_ taken on dst
+ * skb_dst_drop() should not dst_release() this dst
+ */
+static inline void skb_dst_set_noref(struct sk_buff *skb, struct dst_entry *dst)
+{
+	skb->_skb_dst = (unsigned long)dst | SKB_DST_NOREF;
+}
+
+/**
+ * skb_dst_is_noref - Test is skb dst isnt refcounted
+ * @skb: buffer
+ */
+static inline bool skb_dst_is_noref(const struct sk_buff *skb)
+{
+	return (skb->_skb_dst & SKB_DST_NOREF) && skb_dst(skb);
+}
+
 static inline struct rtable *skb_rtable(const struct sk_buff *skb)
 {
 	return (struct rtable *)skb_dst(skb);
diff --git a/include/net/dst.h b/include/net/dst.h
index aac5a5f..ad6ea9e 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -168,6 +168,12 @@ static inline void dst_use(struct dst_entry *dst, unsigned long time)
 	dst->lastuse = time;
 }
 
+static inline void dst_use_noref(struct dst_entry *dst, unsigned long time)
+{
+	dst->__use++;
+	dst->lastuse = time;
+}
+
 static inline
 struct dst_entry * dst_clone(struct dst_entry * dst)
 {
@@ -177,11 +183,46 @@ struct dst_entry * dst_clone(struct dst_entry * dst)
 }
 
 extern void dst_release(struct dst_entry *dst);
+
+static inline void __skb_dst_drop(unsigned long _skb_dst)
+{
+	if (!(_skb_dst & SKB_DST_NOREF))
+		dst_release((struct dst_entry *)(_skb_dst & SKB_DST_PTRMASK));
+}
+
+/**
+ * skb_dst_drop - drops skb dst
+ * @skb: buffer
+ *
+ * Drops dst reference count if a reference was taken.
+ */
 static inline void skb_dst_drop(struct sk_buff *skb)
 {
-	if (skb->_skb_dst)
-		dst_release(skb_dst(skb));
-	skb->_skb_dst = 0UL;
+	if (skb->_skb_dst) {
+		__skb_dst_drop(skb->_skb_dst);
+		skb->_skb_dst = 0UL;
+	}
+}
+
+static inline void skb_dst_copy(struct sk_buff *nskb, const struct sk_buff *oskb)
+{
+	nskb->_skb_dst = oskb->_skb_dst;
+	if (!(nskb->_skb_dst & SKB_DST_NOREF))
+		dst_clone(skb_dst(nskb));
+}
+
+/**
+ * skb_dst_force - makes sure skb dst is refcounted
+ * @skb: buffer
+ *
+ * If dst is not yet refcounted, let's do it
+ */
+static inline void skb_dst_force(struct sk_buff *skb)
+{
+	if (skb->_skb_dst & SKB_DST_NOREF) {
+		skb->_skb_dst &= ~SKB_DST_NOREF;
+		dst_clone(skb_dst(skb));
+	}
 }
 
 /* Children define the path of the packet through the
diff --git a/include/net/route.h b/include/net/route.h
index 2c9fba7..443f6d4 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -112,7 +112,7 @@ extern void		rt_cache_flush_batch(void);
 extern int		__ip_route_output_key(struct net *, struct rtable **, const struct flowi *flp);
 extern int		ip_route_output_key(struct net *, struct rtable **, struct flowi *flp);
 extern int		ip_route_output_flow(struct net *, struct rtable **rp, struct flowi *flp, struct sock *sk, int flags);
-extern int		ip_route_input(struct sk_buff*, __be32 dst, __be32 src, u8 tos, struct net_device *devin);
+extern int		ip_route_input(struct sk_buff*, __be32 dst, __be32 src, u8 tos, struct net_device *devin, bool noref);
 extern unsigned short	ip_rt_frag_needed(struct net *net, struct iphdr *iph, unsigned short new_mtu, struct net_device *dev);
 extern void		ip_rt_send_redirect(struct sk_buff *skb);
 
diff --git a/include/net/sock.h b/include/net/sock.h
index d361c77..0a0f14d 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -598,6 +598,8 @@ static inline int sk_stream_memory_free(struct sock *sk)
 /* OOB backlog add */
 static inline void __sk_add_backlog(struct sock *sk, struct sk_buff *skb)
 {
+	/* dont let skb dst not referenced, we are going to leave rcu lock */
+	skb_dst_force(skb);
 	if (!sk->sk_backlog.tail) {
 		sk->sk_backlog.head = sk->sk_backlog.tail = skb;
 	} else {
diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index 4c4977d..c943ad4 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -350,7 +350,7 @@ static int br_nf_pre_routing_finish(struct sk_buff *skb)
 	}
 	nf_bridge->mask ^= BRNF_NF_BRIDGE_PREROUTING;
 	if (dnat_took_place(skb)) {
-		if ((err = ip_route_input(skb, iph->daddr, iph->saddr, iph->tos, dev))) {
+		if ((err = ip_route_input(skb, iph->daddr, iph->saddr, iph->tos, dev, false))) {
 			struct flowi fl = {
 				.nl_u = {
 					.ip4_u = {
diff --git a/net/core/dev.c b/net/core/dev.c
index 100dcbd..c331b0e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2047,6 +2047,8 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 		 * waiting to be sent out; and the qdisc is not running -
 		 * xmit the skb directly.
 		 */
+		if (!(dev->priv_flags & IFF_XMIT_DST_RELEASE))
+			skb_dst_force(skb);
 		__qdisc_update_bstats(q, skb->len);
 		if (sch_direct_xmit(skb, q, dev, txq, root_lock))
 			__qdisc_run(q);
@@ -2055,6 +2057,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 
 		rc = NET_XMIT_SUCCESS;
 	} else {
+		skb_dst_force(skb);
 		rc = qdisc_enqueue_root(skb, q);
 		qdisc_run(q);
 	}
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4218ff4..f400196 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -531,7 +531,8 @@ static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 	new->transport_header	= old->transport_header;
 	new->network_header	= old->network_header;
 	new->mac_header		= old->mac_header;
-	skb_dst_set(new, dst_clone(skb_dst(old)));
+
+	skb_dst_copy(new, old);
 	new->rxhash		= old->rxhash;
 #ifdef CONFIG_XFRM
 	new->sp			= secpath_get(old->sp);
diff --git a/net/core/sock.c b/net/core/sock.c
index 5104175..894bed6 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -307,6 +307,11 @@ int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 	 */
 	skb_len = skb->len;
 
+	/* we escape from rcu protected region, make sure we dont leak
+	 * a norefcounted dst
+	 */
+	skb_dst_force(skb);
+
 	spin_lock_irqsave(&list->lock, flags);
 	skb->dropcount = atomic_read(&sk->sk_drops);
 	__skb_queue_tail(list, skb);
@@ -1535,6 +1540,7 @@ static void __release_sock(struct sock *sk)
 		do {
 			struct sk_buff *next = skb->next;
 
+			WARN_ON_ONCE(skb_dst_is_noref(skb));
 			skb->next = NULL;
 			sk_backlog_rcv(sk, skb);
 
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 6e74706..502ac9f 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -854,7 +854,7 @@ static int arp_process(struct sk_buff *skb)
 	}
 
 	if (arp->ar_op == htons(ARPOP_REQUEST) &&
-	    ip_route_input(skb, tip, sip, 0, dev) == 0) {
+	    ip_route_input(skb, tip, sip, 0, dev, true) == 0) {
 
 		rt = skb_rtable(skb);
 		addr_type = rt->rt_type;
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index f3d339f..a113c08 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -587,20 +587,20 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info)
 			err = __ip_route_output_key(net, &rt2, &fl);
 		else {
 			struct flowi fl2 = {};
-			struct dst_entry *odst;
+			unsigned long odst;
 
 			fl2.fl4_dst = fl.fl4_src;
 			if (ip_route_output_key(net, &rt2, &fl2))
 				goto relookup_failed;
 
 			/* Ugh! */
-			odst = skb_dst(skb_in);
+			odst = skb_in->_skb_dst; /* save old dst */
 			err = ip_route_input(skb_in, fl.fl4_dst, fl.fl4_src,
-					     RT_TOS(tos), rt2->u.dst.dev);
+					     RT_TOS(tos), rt2->u.dst.dev, false);
 
 			dst_release(&rt2->u.dst);
 			rt2 = skb_rtable(skb_in);
-			skb_dst_set(skb_in, odst);
+			skb_in->_skb_dst = odst; /* restore old dst */
 		}
 
 		if (err)
diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index af10942..0f58609 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -57,6 +57,7 @@ int ip_forward(struct sk_buff *skb)
 	struct rtable *rt;	/* Route we use */
 	struct ip_options * opt	= &(IPCB(skb)->opt);
 
+/*	pr_err("ip_forward() skb->dst=%lx\n", skb->_skb_dst);*/
 	if (skb_warn_if_lro(skb))
 		goto drop;
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 75347ea..cbcde7a 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -220,7 +220,7 @@ static void ip_expire(unsigned long arg)
 		if (qp->user == IP_DEFRAG_CONNTRACK_IN && !skb_dst(head)) {
 			const struct iphdr *iph = ip_hdr(head);
 			int err = ip_route_input(head, iph->daddr, iph->saddr,
-						 iph->tos, head->dev);
+						 iph->tos, head->dev, false);
 			if (unlikely(err))
 				goto out_rcu_unlock;
 
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index f8ab7a3..5d365e8 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -332,7 +332,7 @@ static int ip_rcv_finish(struct sk_buff *skb)
 	 */
 	if (skb_dst(skb) == NULL) {
 		int err = ip_route_input(skb, iph->daddr, iph->saddr, iph->tos,
-					 skb->dev);
+					 skb->dev, true);
 		if (unlikely(err)) {
 			if (err == -EHOSTUNREACH)
 				IP_INC_STATS_BH(dev_net(skb->dev),
diff --git a/net/ipv4/ip_options.c b/net/ipv4/ip_options.c
index 4c09a31..1b65d68 100644
--- a/net/ipv4/ip_options.c
+++ b/net/ipv4/ip_options.c
@@ -601,6 +601,7 @@ int ip_options_rcv_srr(struct sk_buff *skb)
 	unsigned char *optptr = skb_network_header(skb) + opt->srr;
 	struct rtable *rt = skb_rtable(skb);
 	struct rtable *rt2;
+	unsigned long odst;
 	int err;
 
 	if (!opt->srr)
@@ -624,16 +625,16 @@ int ip_options_rcv_srr(struct sk_buff *skb)
 		}
 		memcpy(&nexthop, &optptr[srrptr-1], 4);
 
-		rt = skb_rtable(skb);
+		odst = skb->_skb_dst;
 		skb_dst_set(skb, NULL);
-		err = ip_route_input(skb, nexthop, iph->saddr, iph->tos, skb->dev);
+		err = ip_route_input(skb, nexthop, iph->saddr, iph->tos, skb->dev, false);
 		rt2 = skb_rtable(skb);
 		if (err || (rt2->rt_type != RTN_UNICAST && rt2->rt_type != RTN_LOCAL)) {
-			ip_rt_put(rt2);
-			skb_dst_set(skb, &rt->u.dst);
+			skb_dst_drop(skb);
+			skb->_skb_dst = odst;
 			return -EINVAL;
 		}
-		ip_rt_put(rt);
+		__skb_dst_drop(odst);
 		if (rt2->rt_type != RTN_LOCAL)
 			break;
 		/* Superfast 8) loopback forward */
diff --git a/net/ipv4/netfilter.c b/net/ipv4/netfilter.c
index 82fb43c..e505007 100644
--- a/net/ipv4/netfilter.c
+++ b/net/ipv4/netfilter.c
@@ -17,7 +17,7 @@ int ip_route_me_harder(struct sk_buff *skb, unsigned addr_type)
 	const struct iphdr *iph = ip_hdr(skb);
 	struct rtable *rt;
 	struct flowi fl = {};
-	struct dst_entry *odst;
+	unsigned long odst;
 	unsigned int hh_len;
 	unsigned int type;
 
@@ -51,14 +51,14 @@ int ip_route_me_harder(struct sk_buff *skb, unsigned addr_type)
 		if (ip_route_output_key(net, &rt, &fl) != 0)
 			return -1;
 
-		odst = skb_dst(skb);
+		odst = skb->_skb_dst;
 		if (ip_route_input(skb, iph->daddr, iph->saddr,
-				   RT_TOS(iph->tos), rt->u.dst.dev) != 0) {
+				   RT_TOS(iph->tos), rt->u.dst.dev, false) != 0) {
 			dst_release(&rt->u.dst);
 			return -1;
 		}
 		dst_release(&rt->u.dst);
-		dst_release(odst);
+		__skb_dst_drop(odst);
 	}
 
 	if (skb_dst(skb)->error)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index a947428..4f169ce 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2300,7 +2300,7 @@ martian_source:
 }
 
 int ip_route_input(struct sk_buff *skb, __be32 daddr, __be32 saddr,
-		   u8 tos, struct net_device *dev)
+		   u8 tos, struct net_device *dev, bool noref)
 {
 	struct rtable * rth;
 	unsigned	hash;
@@ -2326,10 +2326,15 @@ int ip_route_input(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 		    rth->fl.mark == skb->mark &&
 		    net_eq(dev_net(rth->u.dst.dev), net) &&
 		    !rt_is_expired(rth)) {
-			dst_use(&rth->u.dst, jiffies);
+			if (noref) {
+				dst_use_noref(&rth->u.dst, jiffies);
+				skb_dst_set_noref(skb, &rth->u.dst);
+			} else {
+				dst_use(&rth->u.dst, jiffies);
+				skb_dst_set(skb, &rth->u.dst);
+			}
 			RT_CACHE_STAT_INC(in_hit);
 			rcu_read_unlock();
-			skb_dst_set(skb, &rth->u.dst);
 			return 0;
 		}
 		RT_CACHE_STAT_INC(in_hlist_search);
@@ -2991,7 +2996,7 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr* nlh, void
 		skb->protocol	= htons(ETH_P_IP);
 		skb->dev	= dev;
 		local_bh_disable();
-		err = ip_route_input(skb, dst, src, rtm->rtm_tos, dev);
+		err = ip_route_input(skb, dst, src, rtm->rtm_tos, dev, false);
 		local_bh_enable();
 
 		rt = skb_rtable(skb);
@@ -3055,7 +3060,7 @@ int ip_rt_dump(struct sk_buff *skb,  struct netlink_callback *cb)
 				continue;
 			if (rt_is_expired(rt))
 				continue;
-			skb_dst_set(skb, dst_clone(&rt->u.dst));
+			skb_dst_set_noref(skb, dst_clone(&rt->u.dst));
 			if (rt_fill_info(net, skb, NETLINK_CB(cb->skb).pid,
 					 cb->nlh->nlmsg_seq, RTM_NEWROUTE,
 					 1, NLM_F_MULTI) <= 0) {
diff --git a/net/ipv4/xfrm4_input.c b/net/ipv4/xfrm4_input.c
index c791bb6..0366cbc 100644
--- a/net/ipv4/xfrm4_input.c
+++ b/net/ipv4/xfrm4_input.c
@@ -28,7 +28,7 @@ static inline int xfrm4_rcv_encap_finish(struct sk_buff *skb)
 		const struct iphdr *iph = ip_hdr(skb);
 
 		if (ip_route_input(skb, iph->daddr, iph->saddr, iph->tos,
-				   skb->dev))
+				   skb->dev, true))
 			goto drop;
 	}
 	return dst_input(skb);
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 2599870..7ae0fa5 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -570,7 +570,7 @@ ip4ip6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
 	} else {
 		ip_rt_put(rt);
 		if (ip_route_input(skb2, eiph->daddr, eiph->saddr, eiph->tos,
-				   skb2->dev) ||
+				   skb2->dev, false) ||
 		    skb_dst(skb2)->dev->type != ARPHRD_TUNNEL)
 			goto out;
 	}
diff --git a/net/netfilter/nf_queue.c b/net/netfilter/nf_queue.c
index c49ef21..cb3cde4 100644
--- a/net/netfilter/nf_queue.c
+++ b/net/netfilter/nf_queue.c
@@ -9,6 +9,7 @@
 #include <linux/rcupdate.h>
 #include <net/protocol.h>
 #include <net/netfilter/nf_queue.h>
+#include <net/dst.h>
 
 #include "nf_internals.h"
 
@@ -170,6 +171,7 @@ static int __nf_queue(struct sk_buff *skb,
 			dev_hold(physoutdev);
 	}
 #endif
+	skb_dst_force(skb);
 	afinfo->saveroute(skb, entry);
 	status = qh->outfn(entry, queuenum);
 
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index aeddabf..21e3976 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -179,7 +179,7 @@ static inline int qdisc_restart(struct Qdisc *q)
 	skb = dequeue_skb(q);
 	if (unlikely(!skb))
 		return 0;
-
+	WARN_ON_ONCE(skb_dst_is_noref(skb));
 	root_lock = qdisc_lock(q);
 	dev = qdisc_dev(q);
 	txq = netdev_get_tx_queue(dev, skb_get_queue_mapping(skb));



^ permalink raw reply related

* Re: OFT - reserving CPU's for networking
From: Eric Dumazet @ 2010-04-29 20:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Stephen Hemminger, Andi Kleen, netdev, Andi Kleen, Peter Zijlstra
In-Reply-To: <alpine.LFD.2.00.1004292055471.2951@localhost.localdomain>

Le jeudi 29 avril 2010 à 21:19 +0200, Thomas Gleixner a écrit :

> Say thanks to Intel/AMD for providing us timers which stop in lower
> c-states.
> 
> Not much we can do about the broadcast lock when several cores are
> going idle and we need to setup a global timer to work around the
> lapic timer stops in C2/C3 issue.
> 
> Simply the C-state timer broadcasting does not scale. And it was never
> meant to scale. It's a workaround for laptops to have functional NOHZ.
> 
> There are several ways to work around that on larger machines:
> 
>  - Restrict c-states
>  - Disable NOHZ and highres timers
>  - idle=poll is definitely the worst of all possible solutions
> 
> > I keep getting asked about taking some core's away from clock and scheduler
> > to be reserved just for network processing. Seeing this kind of stuff
> > makes me wonder if maybe that isn't a half bad idea.
> 
> This comes up every few month and we pointed out several times what
> needs to be done to make this work w/o these weird hacks which put a
> core offline and then start some magic undebugable binary blob on it.
> We have not seen anyone working on this, but the "set cores aside and
> let them do X" idea seems to stick in peoples heads.
> 
> Seriously, that's not a solution. It's going to be some hacked up
> nightmare which is completely unmaintainable.
> 
> Aside of that I seriously doubt that you can do networking w/o time
> and timers.
> 

Thanks a lot !

booting with processor.max_cstate=1 solves the problem

(I already had a CONFIG_NO_HZ=no conf, but highres timer enabled)

Even with _carefuly_ chosen crazy configuration (receiving a packet on a
cpu, then transfert it to another cpu, with a full 16x16 matrix
involved), generating 700.000 IPI per second on the machine seems fine
now.




^ permalink raw reply

* FW: Regarding VLAN TAG in wireshark trace
From: Padmalochan Moharana @ 2010-04-29  5:03 UTC (permalink / raw)
  To: netdev

Hello,
The wireshark captured the message without any
VLAN tag because the driver stripped the VLAN tag of the received message.So
wireshark does not see any VLAN tag in the message. I think any other system
setting or driver is required to capture the VLAN tag. So please let me know
which driver deliver the message without stripping the VLAN tag of the
message.

Please post to netdev@vger.kernel.org

Br,
Padmalochan Moharana

^ permalink raw reply

* Re: [net-next-2.6 PATCH 2/2] add ndo_set_port_profile op support for enic dynamic vnics
From: Arnd Bergmann @ 2010-04-29 12:27 UTC (permalink / raw)
  To: Scott Feldman; +Cc: davem, netdev, chrisw, Jens Osterkamp
In-Reply-To: <C7FE070B.2CA5C%scofeldm@cisco.com>

On Thursday 29 April 2010, Scott Feldman wrote:
> On 4/28/10 12:16 PM, "Arnd Bergmann" <arnd@arndb.de> wrote:
> > On Wednesday 28 April 2010, Scott Feldman wrote:
> > 
> > But if the device is already passed to the guest using pass-thru or
> > containers, you would no longer to query or change the port profile,
> > because it is no longer visible in the host, right?
> 
> Drats, I made a mistake here.  You're right, in the pass-thru case the host
> lost control of the device, so we need another device to proxy the
> port-profile for the pass-thru device.  I had this in the second patch
> submission where I was trying to extend the SR-IOV if_link cmds to included
> port-profile, but that got mired down in the VF discussions.

ok
 
> > I assumed that this was a specific PF in your NIC,  but it now sounds like it
> > could be an internal device that is only visible in your firmware and not
> > exposed
> > as a network interface in Linux, right?
> 
> Yes, that's right.  Without going into implementation details, assume any
> enic has firmware with private mgmt channel to switch to do the equivalent
> of your base device->LLDP->switch.

Ok, now it all makes a *lot* more sense, thanks for the clarification!

For my curiosity, can you point to any documentation about what's actually
going on on the wire? Is it possible or planned to implement the same
protocol in Linux so you can do it with Cisco switches and cheap non-IOV
NICs?

> > Your firmware can obviously find out the right communication channel for
> > a associating a dynamic interface with the switch, but when this is implemnted
> > in software, we cannot generally know that and rely on getting access to the
> > interface that lets us talk to the switch. The information which interface
> > is getting associated however is completely useless to an implementation like
> > this.
> 
> So we're kind of back to where we were with iovnl.  We need to specify both
> devices, the base device that has access to the switch and the target device
> to associate the port-profile with.  Something like:
> 
>    ip port_profile set DEVICE [ base DEVICE ] [ { pre_associate |
>                                                   pre_associate_rr } ]
>                               { name PORT-PROFILE | vsi MGR:VTID:VER }
>                               mac LLADDR
>                               [ vlan VID ]
>                               [ host_uuid HOST_UUID ]
>                               [ client_uuid CLIENT_UUID ]
>                               [ client_name CLIENT_NAME ]
>    ip port_profile del DEVICE [ base DEVICE ] [ mac LLADDR [ vlan VID ] ]
>    ip port_profile show DEVICE [ base DEVICE ]
> 
> The netdev ops are (when netlink msg handled in kernel):
> 
>     ndo_set_port_profile(netdev *target, ...)
>     ndo_get_port_profile(netdev *target, ...)
>     ndo_del_port_profile(netdev *target, ...)
> 
> Base device is optional.  If base device is not given, then target device
> gets netdev ops.  If base device is given, then base device gets netdev ops
> and *target refers to target device.  This covers the following cases:

A bit more complicated than I had hoped for, but it sounds like the only
option that covers all corner cases so far.

> 1. Current enic where base == target since target can communicate directly
> with switch to associate port-profile.  This will not work for the enic
> pass-thru case as noted earlier.  We get:
> 
>     ip port_profile set eth0 name joes-garage ...
>
> And
> 
>     eth0:ndo_set_port_profile(NULL, ...)

Yes.
 
> 2. Future enic for pass-thru case where base != target.  We get:
> 
>     ip port_profile set eth1 base eth0 name joes-garage ...
> 
> And
> 
>     eth0:ndi_set_port_profile(eth1, ...)

Is eth1 the static device and eth0 the dynamic device in this scenario
or the other way round?

Wouldn't you still require access to both devices from the host root
network namespace here or do you just ignore the identifier for the
dynamic device here?
 
> 3. Future VEPA, we get:
> 
>     ip port_profile set eth11 base eth10 vsi 1:23456:7
> 
> And (here netlink msg handled in user-space):
> 
>     VDP msg sent on eth10 to set port-profile on eth11 using vis tuple
     
Yes. I'd prefer still requiring to pass the mac and vlan addresses in this
case, but seems workable.

> Does this work?  I want to get agreement before coding up patch attempt #4.

Seems ok for all I can see at this point, other than the complexity
that results from doing two network protocols through a single netlink
protocol. Maybe Jens and Chris can comment some more on this.

	Arnd

^ permalink raw reply

* [PATCH] macvtap: add ioctl to modify vnet header size
From: Michael S. Tsirkin @ 2010-04-29 13:51 UTC (permalink / raw)
  To: David S. Miller, Arnd Bergmann, Sridhar Samudrala, Eric Dumazet,
	Michael S. Tsirkin

This adds TUNSETVNETHDRSZ/TUNGETVNETHDRSZ support
to macvtap.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---

This just mirrors the support in tun,
http://lkml.org/lkml/2010/3/17/221
so that we can make vhost mergeable buffers work with macvtap as well.

I plan to merge both patches through vhost tree together
with mergeable buffer support. Comments?

 drivers/net/macvtap.c |   31 +++++++++++++++++++++++++++----
 1 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index d97e1fd..6451c4b 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -37,6 +37,7 @@
 struct macvtap_queue {
 	struct sock sk;
 	struct socket sock;
+	int vnet_hdr_sz;
 	struct macvlan_dev *vlan;
 	struct file *file;
 	unsigned int flags;
@@ -280,6 +281,7 @@ static int macvtap_open(struct inode *inode, struct file *file)
 	sock_init_data(&q->sock, &q->sk);
 	q->sk.sk_write_space = macvtap_sock_write_space;
 	q->flags = IFF_VNET_HDR | IFF_NO_PI | IFF_TAP;
+	q->vnet_hdr_sz = sizeof(struct virtio_net_hdr);
 
 	err = macvtap_set_queue(dev, file, q);
 	if (err)
@@ -440,14 +442,14 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
 	int vnet_hdr_len = 0;
 
 	if (q->flags & IFF_VNET_HDR) {
-		vnet_hdr_len = sizeof(vnet_hdr);
+		vnet_hdr_len = q->vnet_hdr_sz;
 
 		err = -EINVAL;
 		if ((len -= vnet_hdr_len) < 0)
 			goto err;
 
 		err = memcpy_fromiovecend((void *)&vnet_hdr, iv, 0,
-					   vnet_hdr_len);
+					   sizeof(vnet_hdr));
 		if (err < 0)
 			goto err;
 		if ((vnet_hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
@@ -529,7 +531,7 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
 
 	if (q->flags & IFF_VNET_HDR) {
 		struct virtio_net_hdr vnet_hdr;
-		vnet_hdr_len = sizeof (vnet_hdr);
+		vnet_hdr_len = q->vnet_hdr_sz;
 		if ((len -= vnet_hdr_len) < 0)
 			return -EINVAL;
 
@@ -537,7 +539,7 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
 		if (ret)
 			return ret;
 
-		if (memcpy_toiovecend(iv, (void *)&vnet_hdr, 0, vnet_hdr_len))
+		if (memcpy_toiovecend(iv, (void *)&vnet_hdr, 0, sizeof(vnet_hdr)))
 			return -EFAULT;
 	}
 
@@ -622,6 +624,8 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
 	struct ifreq __user *ifr = argp;
 	unsigned int __user *up = argp;
 	unsigned int u;
+	int __user *sp = argp;
+	int s;
 	int ret;
 
 	switch (cmd) {
@@ -667,6 +671,25 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
 		q->sk.sk_sndbuf = u;
 		return 0;
 
+	case TUNGETVNETHDRSZ:
+		s = q->vnet_hdr_sz;
+		if (put_user(s, sp))
+			ret = -EFAULT;
+		break;
+
+	case TUNSETVNETHDRSZ:
+		if (get_user(s, sp)) {
+			ret = -EFAULT;
+			break;
+		}
+		if (s < (int)sizeof(struct virtio_net_hdr)) {
+			ret = -EINVAL;
+			break;
+		}
+
+		q->vnet_hdr_sz = s;
+		break;
+
 	case TUNSETOFFLOAD:
 		/* let the user check for future flags */
 		if (arg & ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
-- 
1.7.1.rc1.22.g3163

^ permalink raw reply related

* [Patch v9 0/3] net: reserve ports for applications using fixed port numbers
From: Amerigo Wang @ 2010-04-30  8:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: Octavian Purdila, ebiederm, Eric Dumazet, penguin-kernel, netdev,
	Neil Horman, Amerigo Wang, adobriyan, David Miller

Changes from the previous version:
- Dropped the infiniband part, because Tetsuo modified the related code,
  I will send a separate patch for it once this is accepted.
- Fixed some '\0' issues introduced by the previous version.
- Use copy_from_user(), instead of get_user().
- Use memchr().

------------------>

This patch introduces /proc/sys/net/ipv4/ip_local_reserved_ports which
allows users to reserve ports for third-party applications.

The reserved ports will not be used by automatic port assignments
(e.g. when calling connect() or bind() with port number 0). Explicit
port allocation behavior is unchanged.

There are still some miss behaviors with regard to proc parsing in odd
invalid cases (for "40000\0-40001" all is acknowledged but only 40000
is accepted) but they are not easy to fix without changing the current
"acknowledge how much we accepted" behavior.

Because of that and because the same issues are present in the
existing proc_dointvec code as well I don't think its worth holding
the actual feature (port reservation) after such petty error recovery
issues.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox