Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-2.6 2/2] be2net: remove netif_stop_queue being called before register_netdev.
From: David Miller @ 2011-02-01 23:42 UTC (permalink / raw)
  To: ajit.khaparde; +Cc: netdev
In-Reply-To: <20110131232755.GA4691@akhaparde-VBox>

From: Ajit Khaparde <ajit.khaparde@emulex.com>
Date: Mon, 31 Jan 2011 17:27:55 -0600

> It is illegal to call netif_stop_queue before register_netdev.
> 
> Signed-off-by: Ajit Khaparde <ajit.khaparde@emulex.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH] ipv4: Remove fib_hash.
From: Eric Dumazet @ 2011-02-02  0:48 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20110201.151941.48509941.davem@davemloft.net>

Le mardi 01 février 2011 à 15:19 -0800, David Miller a écrit :
> The time has finally come to remove the hash based routing table
> implementation in ipv4.
> 
> FIB Trie is mature, well tested, and I've done an audit of it's code
> to confirm that it implements insert, delete, and lookup with the same
> identical semantics as fib_hash did.
> 
> If there are any semantic differences found in fib_trie, we should
> simply fix them.
> 
> I've placed the trie statistic config option under advanced router
> configuration.
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>
> ---

Hmm... I know having to maintain two implementations is time consuming,
but I know fib_trie is bigger :

# size net/ipv4/fib_*.o
   text	   data	    bss	    dec	    hex	filename
   7252	    120	      0	   7372	   1ccc	net/ipv4/fib_frontend.o
   7279	     16	      4	   7299	   1c83	net/ipv4/fib_hash.o
   1479	      0	      0	   1479	    5c7	net/ipv4/fib_rules.o
   7885	      0	   2080	   9965	   26ed	net/ipv4/fib_semantics.o
  16222	     16	     16	  16254	   3f7e	net/ipv4/fib_trie.o

In my tests, I know that fib_trie is more expensive for typical routing
tables for hosts (no more than a dozen or entries), in latencies
results, mostly because of icache misses, but also dcache ones.

Thanks



^ permalink raw reply

* Re: [PATCH V10 12/15] ptp: Added a brand new class driver for ptp clocks.
From: John Stultz @ 2011-02-02  2:00 UTC (permalink / raw)
  To: Richard Cochran
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	Alan Cox, Arnd Bergmann, Christoph Lameter, David Miller,
	Krzysztof Halasa, Peter Zijlstra, Rodolfo Giometti,
	Thomas Gleixner, Benjamin Herrenschmidt, H. Peter Anvin,
	Ingo Molnar, Mike Frysinger, Paul Mackerras, Russell King
In-Reply-To: <ada07b1d09b1e8eb717c5203e64856d2eaad7456.1296124770.git.richard.cochran-3mrvs1K0uXizZXS1Dc/lvw@public.gmane.org>

On Thu, 2011-01-27 at 11:59 +0100, Richard Cochran wrote:
> This patch adds an infrastructure for hardware clocks that implement
> IEEE 1588, the Precision Time Protocol (PTP). A class driver offers a
> registration method to particular hardware clock drivers. Each clock is
> presented as a standard POSIX clock.
> 
> The ancillary clock features are exposed in two different ways, via
> the sysfs and by a character device.
> 
> Signed-off-by: Richard Cochran <richard.cochran-3mrvs1K0uXizZXS1Dc/lvw@public.gmane.org>

Few issues below.


> diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
> new file mode 100644
> index 0000000..17be208
> --- /dev/null
> +++ b/drivers/ptp/Kconfig
> @@ -0,0 +1,27 @@
> +#
> +# PTP clock support configuration
> +#
> +
> +menu "PTP clock support"
> +
> +config PTP_1588_CLOCK
> +	tristate "PTP clock support"
> +	depends on EXPERIMENTAL
> +	depends on PPS
> +	help
> +	  The IEEE 1588 standard defines a method to precisely
> +	  synchronize distributed clocks over Ethernet networks. The
> +	  standard defines a Precision Time Protocol (PTP), which can
> +	  be used to achieve synchronization within a few dozen
> +	  microseconds. In addition, with the help of special hardware
> +	  time stamping units, it can be possible to achieve
> +	  synchronization to within a few hundred nanoseconds.
> +
> +	  This driver adds support for PTP clocks as character
> +	  devices. If you want to use a PTP clock, then you should
> +	  also enable at least one clock driver as well.
> +
> +	  To compile this driver as a module, choose M here: the module
> +	  will be called ptp.
> +
> +endmenu

You may want to tweak the kconfig a bit here. If I don't have pps
enabled, if I go into the "PTP clock support" page, I get an empty
screen.

Similarly, its not very discoverable to figure out what you need to
enable to get the driver options to show up, as they depend the drivers
enabled in the networking device section.




> diff --git a/drivers/ptp/ptp_clock.c b/drivers/ptp/ptp_clock.c
> new file mode 100644
> index 0000000..cd2b5e5
> --- /dev/null
> +++ b/drivers/ptp/ptp_clock.c
> @@ -0,0 +1,315 @@
> +/*
> + * PTP 1588 clock support
> + *
> + * Copyright (C) 2010 OMICRON electronics GmbH
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License, or
> + *  (at your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + *  GNU General Public License for more details.
> + *
> + *  You should have received a copy of the GNU General Public License
> + *  along with this program; if not, write to the Free Software
> + *  Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
> + */
> +#include <linux/bitops.h>
> +#include <linux/device.h>
> +#include <linux/err.h>
> +#include <linux/init.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/posix-clock.h>
> +#include <linux/pps_kernel.h>
> +#include <linux/slab.h>
> +#include <linux/syscalls.h>
> +#include <linux/uaccess.h>
> +
> +#include "ptp_private.h"
> +
> +#define PTP_MAX_ALARMS 4
> +#define PTP_MAX_CLOCKS (MAX_CLOCKS/2)

Why MAX_CLOCKS/2 ? Should this scale as MAX_CLOCKS is increased?
Or do you really just mean 8?

> +#define PTP_PPS_DEFAULTS (PPS_CAPTUREASSERT | PPS_OFFSETASSERT)
> +#define PTP_PPS_EVENT PPS_CAPTUREASSERT
> +#define PTP_PPS_MODE (PTP_PPS_DEFAULTS | PPS_CANWAIT | PPS_TSFMT_TSPEC)
> +
> +/* private globals */
> +
> +static dev_t ptp_devt;
> +static struct class *ptp_class;
> +
> +static DECLARE_BITMAP(ptp_clocks_map, PTP_MAX_CLOCKS);
> +static DEFINE_MUTEX(ptp_clocks_mutex); /* protects 'ptp_clocks_map' */
> +
> +/* time stamp event queue operations */
> +
> +static inline int queue_free(struct timestamp_event_queue *q)
> +{
> +	return PTP_MAX_TIMESTAMPS - queue_cnt(q) - 1;
> +}
> +
> +static void enqueue_external_timestamp(struct timestamp_event_queue *queue,
> +				       struct ptp_clock_event *src)
> +{
> +	struct ptp_extts_event *dst;
> +	u32 remainder;
> +
> +	dst = &queue->buf[queue->tail];
> +
> +	dst->index = src->index;
> +	dst->t.sec = div_u64_rem(src->timestamp, 1000000000, &remainder);
> +	dst->t.nsec = remainder;
> +
> +	if (!queue_free(queue))
> +		queue->overflow++;
> +
> +	queue->tail = (queue->tail + 1) % PTP_MAX_TIMESTAMPS;
> +}

So what is serializing access to the timestamp_event_queue here? I don't
see any usage of tsevq_mux by the callers. Am I missing it? It looks
like its called from interrupt context, so do you really need a spinlock
and not a mutex here?


> diff --git a/drivers/ptp/ptp_private.h b/drivers/ptp/ptp_private.h
> new file mode 100644
> index 0000000..941de61
> --- /dev/null
> +++ b/drivers/ptp/ptp_private.h
> @@ -0,0 +1,85 @@
> +/*
> + * PTP 1588 clock support - private declarations for the core module.
> + *
> + * Copyright (C) 2010 OMICRON electronics GmbH
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License, or
> + *  (at your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + *  GNU General Public License for more details.
> + *
> + *  You should have received a copy of the GNU General Public License
> + *  along with this program; if not, write to the Free Software
> + *  Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
> + */
> +#ifndef _PTP_PRIVATE_H_
> +#define _PTP_PRIVATE_H_
> +
> +#include <linux/cdev.h>
> +#include <linux/device.h>
> +#include <linux/mutex.h>
> +#include <linux/posix-clock.h>
> +#include <linux/ptp_clock.h>
> +#include <linux/ptp_clock_kernel.h>
> +#include <linux/time.h>
> +
> +#define PTP_MAX_TIMESTAMPS 128
> +
> +struct timestamp_event_queue {
> +	struct ptp_extts_event buf[PTP_MAX_TIMESTAMPS];
> +	int head;
> +	int tail;
> +	int overflow;
> +};
> +
> +struct ptp_clock {
> +	struct posix_clock clock;
> +	struct device *dev;
> +	struct ptp_clock_info *info;
> +	dev_t devid;
> +	int index; /* index into clocks.map */
> +	struct pps_device *pps_source;
> +	struct timestamp_event_queue tsevq; /* simple fifo for time stamps */
> +	struct mutex tsevq_mux; /* one process at a time reading the fifo */
> +	wait_queue_head_t tsev_wq;
> +};
> +
> +static inline int queue_cnt(struct timestamp_event_queue *q)
> +{
> +	int cnt = q->tail - q->head;
> +	return cnt < 0 ? PTP_MAX_TIMESTAMPS + cnt : cnt;
> +}

This probably needs a comment as to its locking rules. Something like
"Callers must hold tsevq_mux."



> diff --git a/include/linux/ptp_clock.h b/include/linux/ptp_clock.h
> new file mode 100644
> index 0000000..b4ef2cb
> --- /dev/null
> +++ b/include/linux/ptp_clock.h
> @@ -0,0 +1,79 @@
> +/*
> + * PTP 1588 clock support - user space interface
> + *
> + * Copyright (C) 2010 OMICRON electronics GmbH
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License, or
> + *  (at your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + *  GNU General Public License for more details.
> + *
> + *  You should have received a copy of the GNU General Public License
> + *  along with this program; if not, write to the Free Software
> + *  Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
> + */
> +
> +#ifndef _PTP_CLOCK_H_
> +#define _PTP_CLOCK_H_
> +
> +#include <linux/ioctl.h>
> +#include <linux/types.h>
> +
> +/* PTP_xxx bits, for the flags field within the request structures. */
> +#define PTP_ENABLE_FEATURE (1<<0)
> +#define PTP_RISING_EDGE    (1<<1)
> +#define PTP_FALLING_EDGE   (1<<2)
> +
> +/*
> + * struct ptp_clock_time - represents a time value
> + *
> + * The sign of the seconds field applies to the whole value. The
> + * nanoseconds field is always unsigned. The reserved field is
> + * included for sub-nanosecond resolution, should the demand for
> + * this ever appear.
> + *
> + */
> +struct ptp_clock_time {
> +	__s64 sec;  /* seconds */
> +	__u32 nsec; /* nanoseconds */
> +	__u32 reserved;
> +};
> +
> +struct ptp_clock_caps {
> +	int max_adj;   /* Maximum frequency adjustment in parts per billon. */
> +	int n_alarm;   /* Number of programmable alarms. */
> +	int n_ext_ts;  /* Number of external time stamp channels. */
> +	int n_per_out; /* Number of programmable periodic signals. */
> +	int pps;       /* Whether the clock supports a PPS callback. */
> +};
> +
> +struct ptp_extts_request {
> +	unsigned int index; /* Which channel to configure. */
> +	unsigned int flags; /* Bit field for PTP_xxx flags. */
> +};
> +
> +struct ptp_perout_request {
> +	struct ptp_clock_time start;  /* Absolute start time. */
> +	struct ptp_clock_time period; /* Desired period, zero means disable. */
> +	unsigned int index;           /* Which channel to configure. */
> +	unsigned int flags;           /* Reserved for future use. */
> +};

Since these are all new API/ABI structures, would it be wise to pad
these out a bit more? You've got a couple of reserved fields, which is
good, but if you think we're going to expand this at all, we may want to
have a bit more wiggle room. The timex structure had something like 12
unused ints (which came in handy when the tai field was added).

Not sure what the wider opinion is on this though.


> +#define PTP_CLK_MAGIC '='
> +
> +#define PTP_CLOCK_GETCAPS  _IOR(PTP_CLK_MAGIC, 1, struct ptp_clock_caps)
> +#define PTP_EXTTS_REQUEST  _IOW(PTP_CLK_MAGIC, 2, struct ptp_extts_request)
> +#define PTP_PEROUT_REQUEST _IOW(PTP_CLK_MAGIC, 3, struct ptp_perout_request)
> +#define PTP_ENABLE_PPS     _IOW(PTP_CLK_MAGIC, 4, int)
> +
> +struct ptp_extts_event {
> +	struct ptp_clock_time t; /* Time event occured. */
> +	unsigned int index;      /* Which channel produced the event. */
> +};

Same padding suggestion for this as well.


thanks
-john

^ permalink raw reply

* Re: [PATCH V10 13/15] ptp: Added a clock that uses the eTSEC found on the MPC85xx.
From: john stultz @ 2011-02-02  2:04 UTC (permalink / raw)
  To: Richard Cochran
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	Alan Cox, Arnd Bergmann, Christoph Lameter, David Miller,
	Krzysztof Halasa, Peter Zijlstra, Rodolfo Giometti,
	Thomas Gleixner, Benjamin Herrenschmidt, H. Peter Anvin,
	Ingo Molnar, Mike Frysinger, Paul Mackerras, Russell King
In-Reply-To: <606a5342baa53c6aa30b3bc374d5ef2d50ddfc0e.1296124770.git.richard.cochran-3mrvs1K0uXizZXS1Dc/lvw@public.gmane.org>

On Thu, 2011-01-27 at 11:59 +0100, Richard Cochran wrote:
> The eTSEC includes a PTP clock with quite a few features. This patch adds
> support for the basic clock adjustment functions, plus two external time
> stamps, one alarm, and the PPS callback.
> 
> Signed-off-by: Richard Cochran <richard.cochran-3mrvs1K0uXizZXS1Dc/lvw@public.gmane.org>

This looks ok to me.

Acked-by: John Stultz <johnstul-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

> ---
>  Documentation/powerpc/dts-bindings/fsl/tsec.txt |   57 +++
>  arch/powerpc/boot/dts/mpc8313erdb.dts           |   14 +
>  arch/powerpc/boot/dts/mpc8572ds.dts             |   14 +
>  arch/powerpc/boot/dts/p2020ds.dts               |   14 +
>  arch/powerpc/boot/dts/p2020rdb.dts              |   14 +
>  drivers/net/Makefile                            |    1 +
>  drivers/net/gianfar_ptp.c                       |  448 +++++++++++++++++++++++
>  drivers/net/gianfar_ptp_reg.h                   |  113 ++++++
>  drivers/ptp/Kconfig                             |   13 +
>  9 files changed, 688 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/net/gianfar_ptp.c
>  create mode 100644 drivers/net/gianfar_ptp_reg.h
> 
> diff --git a/Documentation/powerpc/dts-bindings/fsl/tsec.txt b/Documentation/powerpc/dts-bindings/fsl/tsec.txt
> index edb7ae1..f6edbb8 100644
> --- a/Documentation/powerpc/dts-bindings/fsl/tsec.txt
> +++ b/Documentation/powerpc/dts-bindings/fsl/tsec.txt
> @@ -74,3 +74,60 @@ Example:
>  		interrupt-parent = <&mpic>;
>  		phy-handle = <&phy0>
>  	};
> +
> +* Gianfar PTP clock nodes
> +
> +General Properties:
> +
> +  - compatible   Should be "fsl,etsec-ptp"
> +  - reg          Offset and length of the register set for the device
> +  - interrupts   There should be at least two interrupts. Some devices
> +                 have as many as four PTP related interrupts.
> +
> +Clock Properties:
> +
> +  - tclk-period  Timer reference clock period in nanoseconds.
> +  - tmr-prsc     Prescaler, divides the output clock.
> +  - tmr-add      Frequency compensation value.
> +  - cksel        0= external clock, 1= eTSEC system clock, 3= RTC clock input.
> +                 Currently the driver only supports choice "1".
> +  - tmr-fiper1   Fixed interval period pulse generator.
> +  - tmr-fiper2   Fixed interval period pulse generator.
> +  - max-adj      Maximum frequency adjustment in parts per billion.
> +
> +  These properties set the operational parameters for the PTP
> +  clock. You must choose these carefully for the clock to work right.
> +  Here is how to figure good values:
> +
> +  TimerOsc     = system clock               MHz
> +  tclk_period  = desired clock period       nanoseconds
> +  NominalFreq  = 1000 / tclk_period         MHz
> +  FreqDivRatio = TimerOsc / NominalFreq     (must be greater that 1.0)
> +  tmr_add      = ceil(2^32 / FreqDivRatio)
> +  OutputClock  = NominalFreq / tmr_prsc     MHz
> +  PulseWidth   = 1 / OutputClock            microseconds
> +  FiperFreq1   = desired frequency in Hz
> +  FiperDiv1    = 1000000 * OutputClock / FiperFreq1
> +  tmr_fiper1   = tmr_prsc * tclk_period * FiperDiv1 - tclk_period
> +  max_adj      = 1000000000 * (FreqDivRatio - 1.0) - 1
> +
> +  The calculation for tmr_fiper2 is the same as for tmr_fiper1. The
> +  driver expects that tmr_fiper1 will be correctly set to produce a 1
> +  Pulse Per Second (PPS) signal, since this will be offered to the PPS
> +  subsystem to synchronize the Linux clock.
> +
> +Example:
> +
> +	ptp_clock@24E00 {
> +		compatible = "fsl,etsec-ptp";
> +		reg = <0x24E00 0xB0>;
> +		interrupts = <12 0x8 13 0x8>;
> +		interrupt-parent = < &ipic >;
> +		tclk-period = <10>;
> +		tmr-prsc    = <100>;
> +		tmr-add     = <0x999999A4>;
> +		cksel       = <0x1>;
> +		tmr-fiper1  = <0x3B9AC9F6>;
> +		tmr-fiper2  = <0x00018696>;
> +		max-adj     = <659999998>;
> +	};
> diff --git a/arch/powerpc/boot/dts/mpc8313erdb.dts b/arch/powerpc/boot/dts/mpc8313erdb.dts
> index 183f2aa..85a7eaa 100644
> --- a/arch/powerpc/boot/dts/mpc8313erdb.dts
> +++ b/arch/powerpc/boot/dts/mpc8313erdb.dts
> @@ -208,6 +208,20 @@
>  			sleep = <&pmc 0x00300000>;
>  		};
> 
> +		ptp_clock@24E00 {
> +			compatible = "fsl,etsec-ptp";
> +			reg = <0x24E00 0xB0>;
> +			interrupts = <12 0x8 13 0x8>;
> +			interrupt-parent = < &ipic >;
> +			tclk-period = <10>;
> +			tmr-prsc    = <100>;
> +			tmr-add     = <0x999999A4>;
> +			cksel       = <0x1>;
> +			tmr-fiper1  = <0x3B9AC9F6>;
> +			tmr-fiper2  = <0x00018696>;
> +			max-adj     = <659999998>;
> +		};
> +
>  		enet0: ethernet@24000 {
>  			#address-cells = <1>;
>  			#size-cells = <1>;
> diff --git a/arch/powerpc/boot/dts/mpc8572ds.dts b/arch/powerpc/boot/dts/mpc8572ds.dts
> index cafc128..74208cd 100644
> --- a/arch/powerpc/boot/dts/mpc8572ds.dts
> +++ b/arch/powerpc/boot/dts/mpc8572ds.dts
> @@ -324,6 +324,20 @@
>  			};
>  		};
> 
> +		ptp_clock@24E00 {
> +			compatible = "fsl,etsec-ptp";
> +			reg = <0x24E00 0xB0>;
> +			interrupts = <68 2 69 2 70 2 71 2>;
> +			interrupt-parent = < &mpic >;
> +			tclk-period = <5>;
> +			tmr-prsc = <200>;
> +			tmr-add = <0xAAAAAAAB>;
> +			cksel = <1>;
> +			tmr-fiper1 = <0x3B9AC9FB>;
> +			tmr-fiper2 = <0x3B9AC9FB>;
> +			max-adj = <499999999>;
> +		};
> +
>  		enet0: ethernet@24000 {
>  			#address-cells = <1>;
>  			#size-cells = <1>;
> diff --git a/arch/powerpc/boot/dts/p2020ds.dts b/arch/powerpc/boot/dts/p2020ds.dts
> index 1101914..39d73bb 100644
> --- a/arch/powerpc/boot/dts/p2020ds.dts
> +++ b/arch/powerpc/boot/dts/p2020ds.dts
> @@ -336,6 +336,20 @@
>  			phy_type = "ulpi";
>  		};
> 
> +		ptp_clock@24E00 {
> +			compatible = "fsl,etsec-ptp";
> +			reg = <0x24E00 0xB0>;
> +			interrupts = <68 2 69 2 70 2>;
> +			interrupt-parent = < &mpic >;
> +			tclk-period = <5>;
> +			tmr-prsc = <200>;
> +			tmr-add = <0xCCCCCCCD>;
> +			cksel = <1>;
> +			tmr-fiper1 = <0x3B9AC9FB>;
> +			tmr-fiper2 = <0x0001869B>;
> +			max-adj = <249999999>;
> +		};
> +
>  		enet0: ethernet@24000 {
>  			#address-cells = <1>;
>  			#size-cells = <1>;
> diff --git a/arch/powerpc/boot/dts/p2020rdb.dts b/arch/powerpc/boot/dts/p2020rdb.dts
> index da4cb0d..5498fb9 100644
> --- a/arch/powerpc/boot/dts/p2020rdb.dts
> +++ b/arch/powerpc/boot/dts/p2020rdb.dts
> @@ -396,6 +396,20 @@
>  			phy_type = "ulpi";
>  		};
> 
> +		ptp_clock@24E00 {
> +			compatible = "fsl,etsec-ptp";
> +			reg = <0x24E00 0xB0>;
> +			interrupts = <68 2 69 2 70 2>;
> +			interrupt-parent = < &mpic >;
> +			tclk-period = <5>;
> +			tmr-prsc = <200>;
> +			tmr-add = <0xCCCCCCCD>;
> +			cksel = <1>;
> +			tmr-fiper1 = <0x3B9AC9FB>;
> +			tmr-fiper2 = <0x0001869B>;
> +			max-adj = <249999999>;
> +		};
> +
>  		enet0: ethernet@24000 {
>  			#address-cells = <1>;
>  			#size-cells = <1>;
> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> index b90738d..c303f5f 100644
> --- a/drivers/net/Makefile
> +++ b/drivers/net/Makefile
> @@ -31,6 +31,7 @@ obj-$(CONFIG_ATL2) += atlx/
>  obj-$(CONFIG_ATL1E) += atl1e/
>  obj-$(CONFIG_ATL1C) += atl1c/
>  obj-$(CONFIG_GIANFAR) += gianfar_driver.o
> +obj-$(CONFIG_PTP_1588_CLOCK_GIANFAR) += gianfar_ptp.o
>  obj-$(CONFIG_TEHUTI) += tehuti.o
>  obj-$(CONFIG_ENIC) += enic/
>  obj-$(CONFIG_JME) += jme.o
> diff --git a/drivers/net/gianfar_ptp.c b/drivers/net/gianfar_ptp.c
> new file mode 100644
> index 0000000..84fff15
> --- /dev/null
> +++ b/drivers/net/gianfar_ptp.c
> @@ -0,0 +1,448 @@
> +/*
> + * PTP 1588 clock using the eTSEC
> + *
> + * Copyright (C) 2010 OMICRON electronics GmbH
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License, or
> + *  (at your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + *  GNU General Public License for more details.
> + *
> + *  You should have received a copy of the GNU General Public License
> + *  along with this program; if not, write to the Free Software
> + *  Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
> + */
> +#include <linux/device.h>
> +#include <linux/hrtimer.h>
> +#include <linux/init.h>
> +#include <linux/interrupt.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/of.h>
> +#include <linux/of_platform.h>
> +#include <linux/timex.h>
> +#include <linux/io.h>
> +
> +#include <linux/ptp_clock_kernel.h>
> +
> +#include "gianfar_ptp_reg.h"
> +#include "gianfar.h"
> +
> +#define DRIVER		"gianfar_ptp"
> +#define N_ALARM		1 /* first alarm is used internally to reset fipers */
> +#define N_EXT_TS	2
> +#define REG_SIZE	sizeof(struct gianfar_ptp_registers)
> +
> +struct etsects {
> +	struct gianfar_ptp_registers *regs;
> +	struct ptp_clock *clock;
> +	struct ptp_clock_info caps;
> +	int irq;
> +	u64 alarm_interval; /* for periodic alarm */
> +	u64 alarm_value;
> +	u32 tclk_period;  /* nanoseconds */
> +	u32 tmr_prsc;
> +	u32 tmr_add;
> +	u32 cksel;
> +	u32 tmr_fiper1;
> +	u32 tmr_fiper2;
> +};
> +
> +/* Private globals */
> +static struct etsects the_clock;
> +DEFINE_SPINLOCK(register_lock);
> +
> +/*
> + * Register access functions
> + */
> +
> +static u64 tmr_cnt_read(struct etsects *etsects)
> +{
> +	u64 ns;
> +	u32 lo, hi;
> +
> +	lo = gfar_read(&etsects->regs->tmr_cnt_l);
> +	hi = gfar_read(&etsects->regs->tmr_cnt_h);
> +	ns = ((u64) hi) << 32;
> +	ns |= lo;
> +	return ns;
> +}
> +
> +static void tmr_cnt_write(struct etsects *etsects, u64 ns)
> +{
> +	u32 hi = ns >> 32;
> +	u32 lo = ns & 0xffffffff;
> +
> +	gfar_write(&etsects->regs->tmr_cnt_l, lo);
> +	gfar_write(&etsects->regs->tmr_cnt_h, hi);
> +}
> +
> +static void set_alarm(struct etsects *etsects)
> +{
> +	u64 ns;
> +	u32 lo, hi;
> +
> +	ns = tmr_cnt_read(etsects) + 1500000000ULL;
> +	ns = div_u64(ns, 1000000000UL) * 1000000000ULL;
> +	ns -= etsects->tclk_period;
> +	hi = ns >> 32;
> +	lo = ns & 0xffffffff;
> +	gfar_write(&etsects->regs->tmr_alarm1_l, lo);
> +	gfar_write(&etsects->regs->tmr_alarm1_h, hi);
> +}
> +
> +static void set_fipers(struct etsects *etsects)
> +{
> +	u32 tmr_ctrl = gfar_read(&etsects->regs->tmr_ctrl);
> +
> +	gfar_write(&etsects->regs->tmr_ctrl,   tmr_ctrl & (~TE));
> +	gfar_write(&etsects->regs->tmr_prsc,   etsects->tmr_prsc);
> +	gfar_write(&etsects->regs->tmr_fiper1, etsects->tmr_fiper1);
> +	gfar_write(&etsects->regs->tmr_fiper2, etsects->tmr_fiper2);
> +	set_alarm(etsects);
> +	gfar_write(&etsects->regs->tmr_ctrl,   tmr_ctrl|TE);
> +}
> +
> +/*
> + * Interrupt service routine
> + */
> +
> +static irqreturn_t isr(int irq, void *priv)
> +{
> +	struct etsects *etsects = priv;
> +	struct ptp_clock_event event;
> +	u64 ns;
> +	u32 ack = 0, lo, hi, mask, val;
> +
> +	val = gfar_read(&etsects->regs->tmr_tevent);
> +
> +	if (val & ETS1) {
> +		ack |= ETS1;
> +		hi = gfar_read(&etsects->regs->tmr_etts1_h);
> +		lo = gfar_read(&etsects->regs->tmr_etts1_l);
> +		event.type = PTP_CLOCK_EXTTS;
> +		event.index = 0;
> +		event.timestamp = ((u64) hi) << 32;
> +		event.timestamp |= lo;
> +		ptp_clock_event(etsects->clock, &event);
> +	}
> +
> +	if (val & ETS2) {
> +		ack |= ETS2;
> +		hi = gfar_read(&etsects->regs->tmr_etts2_h);
> +		lo = gfar_read(&etsects->regs->tmr_etts2_l);
> +		event.type = PTP_CLOCK_EXTTS;
> +		event.index = 1;
> +		event.timestamp = ((u64) hi) << 32;
> +		event.timestamp |= lo;
> +		ptp_clock_event(etsects->clock, &event);
> +	}
> +
> +	if (val & ALM2) {
> +		ack |= ALM2;
> +		if (etsects->alarm_value) {
> +			event.type = PTP_CLOCK_ALARM;
> +			event.index = 0;
> +			event.timestamp = etsects->alarm_value;
> +			ptp_clock_event(etsects->clock, &event);
> +		}
> +		if (etsects->alarm_interval) {
> +			ns = etsects->alarm_value + etsects->alarm_interval;
> +			hi = ns >> 32;
> +			lo = ns & 0xffffffff;
> +			spin_lock(&register_lock);
> +			gfar_write(&etsects->regs->tmr_alarm2_l, lo);
> +			gfar_write(&etsects->regs->tmr_alarm2_h, hi);
> +			spin_unlock(&register_lock);
> +			etsects->alarm_value = ns;
> +		} else {
> +			gfar_write(&etsects->regs->tmr_tevent, ALM2);
> +			spin_lock(&register_lock);
> +			mask = gfar_read(&etsects->regs->tmr_temask);
> +			mask &= ~ALM2EN;
> +			gfar_write(&etsects->regs->tmr_temask, mask);
> +			spin_unlock(&register_lock);
> +			etsects->alarm_value = 0;
> +			etsects->alarm_interval = 0;
> +		}
> +	}
> +
> +	if (val & PP1) {
> +		ack |= PP1;
> +		event.type = PTP_CLOCK_PPS;
> +		ptp_clock_event(etsects->clock, &event);
> +	}
> +
> +	if (ack) {
> +		gfar_write(&etsects->regs->tmr_tevent, ack);
> +		return IRQ_HANDLED;
> +	} else
> +		return IRQ_NONE;
> +}
> +
> +/*
> + * PTP clock operations
> + */
> +
> +static int ptp_gianfar_adjfreq(struct ptp_clock_info *ptp, s32 ppb)
> +{
> +	u64 adj;
> +	u32 diff, tmr_add;
> +	int neg_adj = 0;
> +	struct etsects *etsects = container_of(ptp, struct etsects, caps);
> +
> +	if (ppb < 0) {
> +		neg_adj = 1;
> +		ppb = -ppb;
> +	}
> +	tmr_add = etsects->tmr_add;
> +	adj = tmr_add;
> +	adj *= ppb;
> +	diff = div_u64(adj, 1000000000ULL);
> +
> +	tmr_add = neg_adj ? tmr_add - diff : tmr_add + diff;
> +
> +	gfar_write(&etsects->regs->tmr_add, tmr_add);
> +
> +	return 0;
> +}
> +
> +static int ptp_gianfar_adjtime(struct ptp_clock_info *ptp, s64 delta)
> +{
> +	s64 now;
> +	unsigned long flags;
> +	struct etsects *etsects = container_of(ptp, struct etsects, caps);
> +
> +	spin_lock_irqsave(&register_lock, flags);
> +
> +	now = tmr_cnt_read(etsects);
> +	now += delta;
> +	tmr_cnt_write(etsects, now);
> +
> +	spin_unlock_irqrestore(&register_lock, flags);
> +
> +	set_fipers(etsects);
> +
> +	return 0;
> +}
> +
> +static int ptp_gianfar_gettime(struct ptp_clock_info *ptp, struct timespec *ts)
> +{
> +	u64 ns;
> +	u32 remainder;
> +	unsigned long flags;
> +	struct etsects *etsects = container_of(ptp, struct etsects, caps);
> +
> +	spin_lock_irqsave(&register_lock, flags);
> +
> +	ns = tmr_cnt_read(etsects);
> +
> +	spin_unlock_irqrestore(&register_lock, flags);
> +
> +	ts->tv_sec = div_u64_rem(ns, 1000000000, &remainder);
> +	ts->tv_nsec = remainder;
> +	return 0;
> +}
> +
> +static int ptp_gianfar_settime(struct ptp_clock_info *ptp,
> +			       const struct timespec *ts)
> +{
> +	u64 ns;
> +	unsigned long flags;
> +	struct etsects *etsects = container_of(ptp, struct etsects, caps);
> +
> +	ns = ts->tv_sec * 1000000000ULL;
> +	ns += ts->tv_nsec;
> +
> +	spin_lock_irqsave(&register_lock, flags);
> +
> +	tmr_cnt_write(etsects, ns);
> +	set_fipers(etsects);
> +
> +	spin_unlock_irqrestore(&register_lock, flags);
> +
> +	return 0;
> +}
> +
> +static int ptp_gianfar_enable(struct ptp_clock_info *ptp,
> +			      struct ptp_clock_request *rq, int on)
> +{
> +	struct etsects *etsects = container_of(ptp, struct etsects, caps);
> +	unsigned long flags;
> +	u32 bit, mask;
> +
> +	switch (rq->type) {
> +	case PTP_CLK_REQ_EXTTS:
> +		switch (rq->extts.index) {
> +		case 0:
> +			bit = ETS1EN;
> +			break;
> +		case 1:
> +			bit = ETS2EN;
> +			break;
> +		default:
> +			return -EINVAL;
> +		}
> +		spin_lock_irqsave(&register_lock, flags);
> +		mask = gfar_read(&etsects->regs->tmr_temask);
> +		if (on)
> +			mask |= bit;
> +		else
> +			mask &= ~bit;
> +		gfar_write(&etsects->regs->tmr_temask, mask);
> +		spin_unlock_irqrestore(&register_lock, flags);
> +		return 0;
> +
> +	case PTP_CLK_REQ_PPS:
> +		spin_lock_irqsave(&register_lock, flags);
> +		mask = gfar_read(&etsects->regs->tmr_temask);
> +		if (on)
> +			mask |= PP1EN;
> +		else
> +			mask &= ~PP1EN;
> +		gfar_write(&etsects->regs->tmr_temask, mask);
> +		spin_unlock_irqrestore(&register_lock, flags);
> +		return 0;
> +
> +	default:
> +		break;
> +	}
> +
> +	return -EOPNOTSUPP;
> +}
> +
> +static struct ptp_clock_info ptp_gianfar_caps = {
> +	.owner		= THIS_MODULE,
> +	.name		= "gianfar clock",
> +	.max_adj	= 512000,
> +	.n_alarm	= N_ALARM,
> +	.n_ext_ts	= N_EXT_TS,
> +	.n_per_out	= 0,
> +	.pps		= 1,
> +	.adjfreq	= ptp_gianfar_adjfreq,
> +	.adjtime	= ptp_gianfar_adjtime,
> +	.gettime	= ptp_gianfar_gettime,
> +	.settime	= ptp_gianfar_settime,
> +	.enable		= ptp_gianfar_enable,
> +};
> +
> +/* OF device tree */
> +
> +static int get_of_u32(struct device_node *node, char *str, u32 *val)
> +{
> +	int plen;
> +	const u32 *prop = of_get_property(node, str, &plen);
> +
> +	if (!prop || plen != sizeof(*prop))
> +		return -1;
> +	*val = *prop;
> +	return 0;
> +}
> +
> +static int gianfar_ptp_probe(struct platform_device *dev,
> +			     const struct of_device_id *match)
> +{
> +	struct device_node *node = dev->dev.of_node;
> +	struct etsects *etsects = &the_clock;
> +	struct timespec now;
> +	u32 tmr_ctrl;
> +
> +	etsects->caps = ptp_gianfar_caps;
> +
> +	if (get_of_u32(node, "tclk-period", &etsects->tclk_period) ||
> +	    get_of_u32(node, "tmr-prsc", &etsects->tmr_prsc) ||
> +	    get_of_u32(node, "tmr-add", &etsects->tmr_add) ||
> +	    get_of_u32(node, "cksel", &etsects->cksel) ||
> +	    get_of_u32(node, "tmr-fiper1", &etsects->tmr_fiper1) ||
> +	    get_of_u32(node, "tmr-fiper2", &etsects->tmr_fiper2) ||
> +	    get_of_u32(node, "max-adj", &etsects->caps.max_adj)) {
> +		pr_err("device tree node missing required elements\n");
> +		return -ENODEV;
> +	}
> +
> +	etsects->irq = irq_of_parse_and_map(node, 0);
> +
> +	if (etsects->irq == NO_IRQ) {
> +		pr_err("irq not in device tree\n");
> +		return -ENODEV;
> +	}
> +	if (request_irq(etsects->irq, isr, 0, DRIVER, etsects)) {
> +		pr_err("request_irq failed\n");
> +		return -ENODEV;
> +	}
> +	etsects->regs = of_iomap(node, 0);
> +	if (!etsects->regs) {
> +		pr_err("of_iomap ptp registers failed\n");
> +		return -EINVAL;
> +	}
> +	getnstimeofday(&now);
> +	ptp_gianfar_settime(&etsects->caps, &now);
> +
> +	tmr_ctrl =
> +	  (etsects->tclk_period & TCLK_PERIOD_MASK) << TCLK_PERIOD_SHIFT |
> +	  (etsects->cksel & CKSEL_MASK) << CKSEL_SHIFT;
> +
> +	gfar_write(&etsects->regs->tmr_ctrl,   tmr_ctrl);
> +	gfar_write(&etsects->regs->tmr_add,    etsects->tmr_add);
> +	gfar_write(&etsects->regs->tmr_prsc,   etsects->tmr_prsc);
> +	gfar_write(&etsects->regs->tmr_fiper1, etsects->tmr_fiper1);
> +	gfar_write(&etsects->regs->tmr_fiper2, etsects->tmr_fiper2);
> +	set_alarm(etsects);
> +	gfar_write(&etsects->regs->tmr_ctrl,   tmr_ctrl|FS|RTPE|TE);
> +
> +	etsects->clock = ptp_clock_register(&etsects->caps);
> +
> +	return IS_ERR(etsects->clock) ? PTR_ERR(etsects->clock) : 0;
> +}
> +
> +static int gianfar_ptp_remove(struct platform_device *dev)
> +{
> +	gfar_write(&the_clock.regs->tmr_temask, 0);
> +	gfar_write(&the_clock.regs->tmr_ctrl,   0);
> +
> +	ptp_clock_unregister(the_clock.clock);
> +	free_irq(the_clock.irq, &the_clock);
> +	iounmap(the_clock.regs);
> +
> +	return 0;
> +}
> +
> +static struct of_device_id match_table[] = {
> +	{ .compatible = "fsl,etsec-ptp" },
> +	{},
> +};
> +
> +static struct of_platform_driver gianfar_ptp_driver = {
> +	.driver = {
> +		.name		= "gianfar_ptp",
> +		.of_match_table	= match_table,
> +		.owner		= THIS_MODULE,
> +	},
> +	.probe       = gianfar_ptp_probe,
> +	.remove      = gianfar_ptp_remove,
> +};
> +
> +/* module operations */
> +
> +static int __init ptp_gianfar_init(void)
> +{
> +	return of_register_platform_driver(&gianfar_ptp_driver);
> +}
> +
> +module_init(ptp_gianfar_init);
> +
> +static void __exit ptp_gianfar_exit(void)
> +{
> +	of_unregister_platform_driver(&gianfar_ptp_driver);
> +}
> +
> +module_exit(ptp_gianfar_exit);
> +
> +MODULE_AUTHOR("Richard Cochran <richard.cochran-3mrvs1K0uXizZXS1Dc/lvw@public.gmane.org>");
> +MODULE_DESCRIPTION("PTP clock using the eTSEC");
> +MODULE_LICENSE("GPL");
> diff --git a/drivers/net/gianfar_ptp_reg.h b/drivers/net/gianfar_ptp_reg.h
> new file mode 100644
> index 0000000..95e171f
> --- /dev/null
> +++ b/drivers/net/gianfar_ptp_reg.h
> @@ -0,0 +1,113 @@
> +/* gianfar_ptp_reg.h
> + * Generated by regen.tcl on Thu May 13 01:38:57 PM CEST 2010
> + *
> + * PTP 1588 clock using the gianfar eTSEC
> + *
> + * Copyright (C) 2010 OMICRON electronics GmbH
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License, or
> + *  (at your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + *  GNU General Public License for more details.
> + *
> + *  You should have received a copy of the GNU General Public License
> + *  along with this program; if not, write to the Free Software
> + *  Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
> + */
> +#ifndef _GIANFAR_PTP_REG_H_
> +#define _GIANFAR_PTP_REG_H_
> +
> +struct gianfar_ptp_registers {
> +	u32 tmr_ctrl;     /* Timer control register */
> +	u32 tmr_tevent;   /* Timestamp event register */
> +	u32 tmr_temask;   /* Timer event mask register */
> +	u32 tmr_pevent;   /* Timestamp event register */
> +	u32 tmr_pemask;   /* Timer event mask register */
> +	u32 tmr_stat;     /* Timestamp status register */
> +	u32 tmr_cnt_h;    /* Timer counter high register */
> +	u32 tmr_cnt_l;    /* Timer counter low register */
> +	u32 tmr_add;      /* Timer drift compensation addend register */
> +	u32 tmr_acc;      /* Timer accumulator register */
> +	u32 tmr_prsc;     /* Timer prescale */
> +	u8  res1[4];
> +	u32 tmroff_h;     /* Timer offset high */
> +	u32 tmroff_l;     /* Timer offset low */
> +	u8  res2[8];
> +	u32 tmr_alarm1_h; /* Timer alarm 1 high register */
> +	u32 tmr_alarm1_l; /* Timer alarm 1 high register */
> +	u32 tmr_alarm2_h; /* Timer alarm 2 high register */
> +	u32 tmr_alarm2_l; /* Timer alarm 2 high register */
> +	u8  res3[48];
> +	u32 tmr_fiper1;   /* Timer fixed period interval */
> +	u32 tmr_fiper2;   /* Timer fixed period interval */
> +	u32 tmr_fiper3;   /* Timer fixed period interval */
> +	u8  res4[20];
> +	u32 tmr_etts1_h;  /* Timestamp of general purpose external trigger */
> +	u32 tmr_etts1_l;  /* Timestamp of general purpose external trigger */
> +	u32 tmr_etts2_h;  /* Timestamp of general purpose external trigger */
> +	u32 tmr_etts2_l;  /* Timestamp of general purpose external trigger */
> +};
> +
> +/* Bit definitions for the TMR_CTRL register */
> +#define ALM1P                 (1<<31) /* Alarm1 output polarity */
> +#define ALM2P                 (1<<30) /* Alarm2 output polarity */
> +#define FS                    (1<<28) /* FIPER start indication */
> +#define PP1L                  (1<<27) /* Fiper1 pulse loopback mode enabled. */
> +#define PP2L                  (1<<26) /* Fiper2 pulse loopback mode enabled. */
> +#define TCLK_PERIOD_SHIFT     (16) /* 1588 timer reference clock period. */
> +#define TCLK_PERIOD_MASK      (0x3ff)
> +#define RTPE                  (1<<15) /* Record Tx Timestamp to PAL Enable. */
> +#define FRD                   (1<<14) /* FIPER Realignment Disable */
> +#define ESFDP                 (1<<11) /* External Tx/Rx SFD Polarity. */
> +#define ESFDE                 (1<<10) /* External Tx/Rx SFD Enable. */
> +#define ETEP2                 (1<<9) /* External trigger 2 edge polarity */
> +#define ETEP1                 (1<<8) /* External trigger 1 edge polarity */
> +#define COPH                  (1<<7) /* Generated clock (TSEC_1588_GCLK) output phase. */
> +#define CIPH                  (1<<6) /* External oscillator input clock phase. */
> +#define TMSR                  (1<<5) /* Timer soft reset. When enabled, it resets all the timer registers and state machines. */
> +#define BYP                   (1<<3) /* Bypass drift compensated clock */
> +#define TE                    (1<<2) /* 1588 timer enable. If not enabled, all the timer registers and state machines are disabled. */
> +#define CKSEL_SHIFT           (0) /* 1588 Timer reference clock source select. */
> +#define CKSEL_MASK            (0x3)
> +
> +/* Bit definitions for the TMR_TEVENT register */
> +#define ETS2                  (1<<25) /* External trigger 2 timestamp sampled */
> +#define ETS1                  (1<<24) /* External trigger 1 timestamp sampled */
> +#define ALM2                  (1<<17) /* Current time equaled alarm time register 2 */
> +#define ALM1                  (1<<16) /* Current time equaled alarm time register 1 */
> +#define PP1                   (1<<7) /* Indicates that a periodic pulse has been generated based on FIPER1 register */
> +#define PP2                   (1<<6) /* Indicates that a periodic pulse has been generated based on FIPER2 register */
> +#define PP3                   (1<<5) /* Indicates that a periodic pulse has been generated based on FIPER3 register */
> +
> +/* Bit definitions for the TMR_TEMASK register */
> +#define ETS2EN                (1<<25) /* External trigger 2 timestamp sample event enable */
> +#define ETS1EN                (1<<24) /* External trigger 1 timestamp sample event enable */
> +#define ALM2EN                (1<<17) /* Timer ALM2 event enable */
> +#define ALM1EN                (1<<16) /* Timer ALM1 event enable */
> +#define PP1EN                 (1<<7) /* Periodic pulse event 1 enable */
> +#define PP2EN                 (1<<6) /* Periodic pulse event 2 enable */
> +
> +/* Bit definitions for the TMR_PEVENT register */
> +#define TXP2                  (1<<9) /* Indicates that a PTP frame has been transmitted and its timestamp is stored in TXTS2 register */
> +#define TXP1                  (1<<8) /* Indicates that a PTP frame has been transmitted and its timestamp is stored in TXTS1 register */
> +#define RXP                   (1<<0) /* Indicates that a PTP frame has been received */
> +
> +/* Bit definitions for the TMR_PEMASK register */
> +#define TXP2EN                (1<<9) /* Transmit PTP packet event 2 enable */
> +#define TXP1EN                (1<<8) /* Transmit PTP packet event 1 enable */
> +#define RXPEN                 (1<<0) /* Receive PTP packet event enable */
> +
> +/* Bit definitions for the TMR_STAT register */
> +#define STAT_VEC_SHIFT        (0) /* Timer general purpose status vector */
> +#define STAT_VEC_MASK         (0x3f)
> +
> +/* Bit definitions for the TMR_PRSC register */
> +#define PRSC_OCK_SHIFT        (0) /* Output clock division/prescale factor. */
> +#define PRSC_OCK_MASK         (0xffff)
> +
> +#endif
> diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
> index 17be208..a4df298 100644
> --- a/drivers/ptp/Kconfig
> +++ b/drivers/ptp/Kconfig
> @@ -24,4 +24,17 @@ config PTP_1588_CLOCK
>  	  To compile this driver as a module, choose M here: the module
>  	  will be called ptp.
> 
> +config PTP_1588_CLOCK_GIANFAR
> +	tristate "Freescale eTSEC as PTP clock"
> +	depends on PTP_1588_CLOCK
> +	depends on GIANFAR
> +	help
> +	  This driver adds support for using the eTSEC as a PTP
> +	  clock. This clock is only useful if your PTP programs are
> +	  getting hardware time stamps on the PTP Ethernet packets
> +	  using the SO_TIMESTAMPING API.
> +
> +	  To compile this driver as a module, choose M here: the module
> +	  will be called gianfar_ptp.
> +
>  endmenu

^ permalink raw reply

* Re: [PATCH V10 14/15] ptp: Added a clock driver for the IXP46x.
From: john stultz @ 2011-02-02  2:10 UTC (permalink / raw)
  To: Richard Cochran
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	Alan Cox, Arnd Bergmann, Christoph Lameter, David Miller,
	Krzysztof Halasa, Peter Zijlstra, Rodolfo Giometti,
	Thomas Gleixner, Benjamin Herrenschmidt, H. Peter Anvin,
	Ingo Molnar, Mike Frysinger, Paul Mackerras, Russell King
In-Reply-To: <0319138702479bc484596a2b10c681a7d8313b98.1296124770.git.richard.cochran-3mrvs1K0uXizZXS1Dc/lvw@public.gmane.org>

On Thu, 2011-01-27 at 12:00 +0100, Richard Cochran wrote:
> This patch adds a driver for the hardware time stamping unit found on the
> IXP465. The basic clock operations and an external trigger are implemented.
> 
> Signed-off-by: Richard Cochran <richard.cochran-3mrvs1K0uXizZXS1Dc/lvw@public.gmane.org>

Minor nit below, but otherwise,
Acked-by: John Stultz <johnstul-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

> diff --git a/drivers/ptp/ptp_ixp46x.c b/drivers/ptp/ptp_ixp46x.c
> new file mode 100644
> index 0000000..c860030
> --- /dev/null
> +++ b/drivers/ptp/ptp_ixp46x.c
[snip]
> +static int __init ptp_ixp_init(void)
> +{
> +	if (!cpu_is_ixp46x())
> +		return -ENODEV;
> +
> +	ixp_clock.regs = 
> +		(struct ixp46x_ts_regs __iomem *)IXP4XX_TIMESYNC_BASE_VIRT;

Minor whitespace issue here

^ permalink raw reply

* Re: [PATCH V10 15/15] ptp: Added a clock driver for the National Semiconductor PHYTER.
From: john stultz @ 2011-02-02  2:11 UTC (permalink / raw)
  To: Richard Cochran
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	Alan Cox, Arnd Bergmann, Christoph Lameter, David Miller,
	Krzysztof Halasa, Peter Zijlstra, Rodolfo Giometti,
	Thomas Gleixner, Benjamin Herrenschmidt, H. Peter Anvin,
	Ingo Molnar, Mike Frysinger, Paul Mackerras, Russell King
In-Reply-To: <9a8c774950f1a78259a49681ee2fe7977ace44d6.1296124770.git.richard.cochran-3mrvs1K0uXizZXS1Dc/lvw@public.gmane.org>

On Thu, 2011-01-27 at 12:00 +0100, Richard Cochran wrote:
> This patch adds support for the PTP clock found on the DP83640.
> The basic clock operations and one external time stamp have
> been implemented.
> 
> Signed-off-by: Richard Cochran <richard.cochran-3mrvs1K0uXizZXS1Dc/lvw@public.gmane.org>

Acked-by: John Stultz <johnstul-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply

* Re: [PATCH] ipv4: Remove fib_hash.
From: David Miller @ 2011-02-02  2:15 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1296607693.2607.7.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 02 Feb 2011 01:48:13 +0100

> Hmm... I know having to maintain two implementations is time consuming,
> but I know fib_trie is bigger :
> 
> # size net/ipv4/fib_*.o
>    text	   data	    bss	    dec	    hex	filename
>    7252	    120	      0	   7372	   1ccc	net/ipv4/fib_frontend.o
>    7279	     16	      4	   7299	   1c83	net/ipv4/fib_hash.o
>    1479	      0	      0	   1479	    5c7	net/ipv4/fib_rules.o
>    7885	      0	   2080	   9965	   26ed	net/ipv4/fib_semantics.o
>   16222	     16	     16	  16254	   3f7e	net/ipv4/fib_trie.o
> 
> In my tests, I know that fib_trie is more expensive for typical routing
> tables for hosts (no more than a dozen or entries), in latencies
> results, mostly because of icache misses, but also dcache ones.

It's mostly the rebalancing code that takes up the space.

The lookup path is on the same order of magnitude as the fib hash
stuff was.

In any event, we have several months to hash out any regressions and I
think if I didn't do this removal nobody would work on it so... :-)

^ permalink raw reply

* Re: [PATCHv2 dontapply] vhost-net tx tuning
From: Michael S. Tsirkin @ 2011-02-02  4:36 UTC (permalink / raw)
  To: Sridhar Samudrala; +Cc: Steve Dobbelstein, mashirle, kvm, netdev
In-Reply-To: <1296601658.30191.46.camel@sridhar.beaverton.ibm.com>

On Tue, Feb 01, 2011 at 03:07:38PM -0800, Sridhar Samudrala wrote:
> On Tue, 2011-02-01 at 17:52 +0200, Michael S. Tsirkin wrote:
> > OK, so thinking about it more, maybe the issue is this:
> > tx becomes full. We process one request and interrupt the guest,
> > then it adds one request and the queue is full again.
> > 
> > Maybe the following will help it stabilize?  By default with it we will
> > only interrupt when we see an empty ring.
> > Which is liklely too much: pls try other values
> > in the middle: e.g. make bufs half the ring,
> > or bytes some small value like half ring * 200, or packets some
> > small value etc.
> > 
> > Set any one parameter to 0 to get current
> > behaviour (interrupt immediately when enabled).
> > 
> > Warning: completely untested.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > 
> > ---
> > 
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index aac05bc..6769cdc 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -32,6 +32,13 @@
> >   * Using this limit prevents one virtqueue from starving others. */
> >  #define VHOST_NET_WEIGHT 0x80000
> > 
> > +int tx_bytes_coalesce = 1000000000;
> > +module_param(tx_bytes_coalesce, int, 0644);
> > +int tx_bufs_coalesce = 1000000000;
> > +module_param(tx_bufs_coalesce, int, 0644);
> > +int tx_packets_coalesce = 1000000000;
> > +module_param(tx_packets_coalesce, int, 0644);
> > +
> >  enum {
> >  	VHOST_NET_VQ_RX = 0,
> >  	VHOST_NET_VQ_TX = 1,
> > @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net)
> >  	int err, wmem;
> >  	size_t hdr_size;
> >  	struct socket *sock;
> > +	int bytes_coalesced = 0;
> > +	int bufs_coalesced = 0;
> > +	int packets_coalesced = 0;
> > 
> >  	/* TODO: check that we are running from vhost_worker? */
> >  	sock = rcu_dereference_check(vq->private_data, 1);
> > @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net)
> >  		if (err != len)
> >  			pr_debug("Truncated TX packet: "
> >  				 " len %d != %zd\n", err, len);
> > -		vhost_add_used_and_signal(&net->dev, vq, head, 0);
> >  		total_len += len;
> > +		packets_coalesced += 1;
> > +		bytes_coalesced += len;
> > +		bufs_coalesced += out;
> > +		if (unlikely(packets_coalesced > tx_packets_coalesce ||
> > +			     bytes_coalesced > tx_bytes_coalesce ||
> > +			     bufs_coalesced > tx_bufs_coalesce))
> > +			vhost_add_used_and_signal(&net->dev, vq, head, 0);
> 
> I think the counters that exceed the limits need to be reset to 0 here.
> Otherwise we keep signaling for every buffer once we hit this condition.
> 
> Thanks
> Sridhar

Correct, good catch.

> > +		else
> > +			vhost_add_used(vq, head, 0);
> >  		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> >  			vhost_poll_queue(&vq->poll);
> >  			break;
> >  		}
> >  	}
> > 
> > +	if (likely(packets_coalesced &&
> > +		   bytes_coalesced &&
> > +		   bufs_coalesced))
> > +		vhost_signal(&net->dev, vq);
> >  	mutex_unlock(&vq->mutex);
> >  }
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Network performance with small packets
From: Krishna Kumar2 @ 2011-02-02  4:39 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: David Miller, kvm, Shirley Ma, netdev, rusty, steved
In-Reply-To: <20110201214114.GA31105@redhat.com>

> "Michael S. Tsirkin" <mst@redhat.com> 02/02/2011 03:11 AM
>
> On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote:
> > On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote:
> > > Confused. We compare capacity to skb frags, no?
> > > That's sg I think ...
> >
> > Current guest kernel use indirect buffers, num_free returns how many
> > available descriptors not skb frags. So it's wrong here.
> >
> > Shirley
>
> I see. Good point. In other words when we complete the buffer
> it was indirect, but when we add a new one we
> can not allocate indirect so we consume.
> And then we start the queue and add will fail.
> I guess we need some kind of API to figure out
> whether the buf we complete was indirect?
>
> Another failure mode is when skb_xmit_done
> wakes the queue: it might be too early, there
> might not be space for the next packet in the vq yet.

I am not sure if this is the problem - shouldn't you
see these messages:
	if (likely(capacity == -ENOMEM)) {
		dev_warn(&dev->dev,
			"TX queue failure: out of memory\n");
	} else {
		dev->stats.tx_fifo_errors++;
		dev_warn(&dev->dev,
			"Unexpected TX queue failure: %d\n",
			capacity);
	}
in next xmit? I am not getting this in my testing.

> A solution might be to keep some kind of pool
> around for indirect, we wanted to do it for block anyway ...

Your vhost patch should fix this automatically. Right?

Thanks,

- KK


^ permalink raw reply

* I have a question about BIGN
From: Bill Gross @ 2011-02-02  4:15 UTC (permalink / raw)
  To: netdev

I was searching online to find more info about BIGN
and I came across your information.

Can you tell me, are you still involved with BIGN? If
you are, how are things going for you?

Please let me know.

Thanks
Bill Gross

^ permalink raw reply

* Re: Network performance with small packets
From: Michael S. Tsirkin @ 2011-02-02  4:40 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm, mashirle,
	netdev
In-Reply-To: <1296601197.26937.833.camel@localhost.localdomain>

On Tue, Feb 01, 2011 at 02:59:57PM -0800, Shirley Ma wrote:
> On Tue, 2011-02-01 at 23:56 +0200, Michael S. Tsirkin wrote:
> > There are flags for bytes, buffers and packets.
> > Try playing with any one of them :)
> > Just be sure to use v2.
> > 
> > 
> > >I would like to change it to
> > > half of the ring size instead for signaling. Is that OK?
> > > 
> > > Shirley
> > > 
> > > 
> > 
> > Sure that is why I made it a parameter so you can experiment. 
> 
> The initial test results shows that the CPUs utilization has been
> reduced some, and BW has increased some with the default parameters,
> like 1K message size BW goes from 2.5Gb/s about 2.8Gb/s, CPU utilization
> down from 4x% to 38%, (Similar results from the patch I submitted a
> while ago to reduce signaling on vhost) but far away from dropping
> packet results.
> 
> I am going to change the code to use 1/2 ring size to wake the netif
> queue.
> 
> Shirley

Just tweak the parameters with sysfs, you do not have to edit the code:
echo 64 > /sys/module/vhost_net/parameters/tx_bufs_coalesce

Or in a similar way for tx_packets_coalesce (since we use indirect,
packets will typically use 1 buffer each).

-- 
MST

^ permalink raw reply

* Re: Network performance with small packets
From: Michael S. Tsirkin @ 2011-02-02  4:42 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: David Miller, kvm, Shirley Ma, netdev, rusty, steved
In-Reply-To: <OF5CFD7B6A.4CBE4E3B-ON6525782B.0014DC5B-6525782B.001979CB@in.ibm.com>

On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote:
> > "Michael S. Tsirkin" <mst@redhat.com> 02/02/2011 03:11 AM
> >
> > On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote:
> > > On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote:
> > > > Confused. We compare capacity to skb frags, no?
> > > > That's sg I think ...
> > >
> > > Current guest kernel use indirect buffers, num_free returns how many
> > > available descriptors not skb frags. So it's wrong here.
> > >
> > > Shirley
> >
> > I see. Good point. In other words when we complete the buffer
> > it was indirect, but when we add a new one we
> > can not allocate indirect so we consume.
> > And then we start the queue and add will fail.
> > I guess we need some kind of API to figure out
> > whether the buf we complete was indirect?
> >
> > Another failure mode is when skb_xmit_done
> > wakes the queue: it might be too early, there
> > might not be space for the next packet in the vq yet.
> 
> I am not sure if this is the problem - shouldn't you
> see these messages:
> 	if (likely(capacity == -ENOMEM)) {
> 		dev_warn(&dev->dev,
> 			"TX queue failure: out of memory\n");
> 	} else {
> 		dev->stats.tx_fifo_errors++;
> 		dev_warn(&dev->dev,
> 			"Unexpected TX queue failure: %d\n",
> 			capacity);
> 	}
> in next xmit? I am not getting this in my testing.

Yes, I don't think we hit this in our testing,
simply because we don't stress memory.
Disable indirect, then you might see this.

> > A solution might be to keep some kind of pool
> > around for indirect, we wanted to do it for block anyway ...
> 
> Your vhost patch should fix this automatically. Right?

Reduce the chance of it happening, yes.

> 
> Thanks,
> 
> - KK

^ permalink raw reply

* Re:,,,,,
From: young chang @ 2011-02-02 13:47 UTC (permalink / raw)


May I ask if you would be eligible to pursue a Business Proposal of $19.7m with me if you dont mind? Let me know if you are interested.

^ permalink raw reply

* Re: [PATCH] include/net/genetlink.h: Allow genlmsg_cancel to accept a NULL argument
From: Julia Lawall @ 2011-02-02  5:51 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-kernel, paul.moore, kernel-janitors
In-Reply-To: <20110201.145410.115936566.davem@davemloft.net>

On Tue, 1 Feb 2011, David Miller wrote:

> From: Julia Lawall <julia@diku.dk>
> Date: Fri, 28 Jan 2011 16:43:40 +0100 (CET)
> 
> > nlmsg_cancel can accept NULL as its second argument, so for similarity,
> > this patch extends genlmsg_cancel to be able to accept a NULL second
> > argument as well.
> > 
> > Signed-off-by: Julia Lawall <julia@diku.dk>
> 
> I did a scan of all of the cases where this interface is used, and
> I cannot find a situation where this capability would even be useful.
> 
> The use pattern is always:
> 
> 	hdr = genlmsg_put(skb, ...);
> 	if (!hdr)
> 		goto out;
> 
> 	NLA_PUT_*();
> 	NLA_PUT_*();
> 	....
> 
> 	return genlmsg_end(skb, hdr);
> 
> nla_put_failure:
> 	genlmsg_cancel(skb, hdr);
> out:
> 	return -EWHATEVER;
> 
> Always, hdr will be non-NULL.
> 
> We have to allocate the header first, then put the netlink
> attributes.
> 
> Looking over users of nlmsg_cancel(), the situation seems to
> match identically.
> 
> Therefore, it seems to me that it makes more sense to remove
> the NULL check from nlmsg_cancel() than to add the NULL check
> to genlmsg_cancel().

I saw lots of cases that could be done like this, but were not; they had 
goto nla_put_failure instead.

I will double check.

julia

^ permalink raw reply

* Re: Network performance with small packets
From: Shirley Ma @ 2011-02-02  6:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm, mashirle,
	netdev
In-Reply-To: <20110202044002.GB3818@redhat.com>

On Wed, 2011-02-02 at 06:40 +0200, Michael S. Tsirkin wrote:
> ust tweak the parameters with sysfs, you do not have to edit the code:
> echo 64 > /sys/module/vhost_net/parameters/tx_bufs_coalesce
> 
> Or in a similar way for tx_packets_coalesce (since we use indirect,
> packets will typically use 1 buffer each).

We should use packets instead of buffers, in indirect case, one packet
has multiple buffers, each packet uses one descriptor from the ring
(default size is 256).

echo 128 > /sys/module/vhost_net/parameters/tx_packets_coalesce

The way I am changing is only when netif queue has stopped, then we
start to count num_free descriptors to send the signal to wake netif
queue.

Shirley

^ permalink raw reply

* Re: [PATCH] include/net/genetlink.h: Allow genlmsg_cancel to accept a NULL argument
From: Julia Lawall @ 2011-02-02  6:17 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-kernel, paul.moore, kernel-janitors
In-Reply-To: <20110201.145410.115936566.davem@davemloft.net>

On Tue, 1 Feb 2011, David Miller wrote:

> From: Julia Lawall <julia@diku.dk>
> Date: Fri, 28 Jan 2011 16:43:40 +0100 (CET)
> 
> > nlmsg_cancel can accept NULL as its second argument, so for similarity,
> > this patch extends genlmsg_cancel to be able to accept a NULL second
> > argument as well.
> > 
> > Signed-off-by: Julia Lawall <julia@diku.dk>
> 
> I did a scan of all of the cases where this interface is used, and
> I cannot find a situation where this capability would even be useful.
> 
> The use pattern is always:
> 
> 	hdr = genlmsg_put(skb, ...);
> 	if (!hdr)
> 		goto out;
> 
> 	NLA_PUT_*();
> 	NLA_PUT_*();
> 	....
> 
> 	return genlmsg_end(skb, hdr);
> 
> nla_put_failure:
> 	genlmsg_cancel(skb, hdr);
> out:
> 	return -EWHATEVER;

This pattern occurred in eg:

net/netlabel/netlabel_unlabeled.c

in the function netlbl_unlabel_staticlist_gen and in other netlabel code, 
as well as in net/wireless/nl80211.c, but with the function nl80211hdr_put 
instead of genlmsg_put.  I submitted patches for all of these cases, so 
that is perhaps why you don't see them.  But someone suggested to change 
genlmsg_cancel as well, to be as permissive as nlmsg_cancel.

For nlmsg_cancel, there are two occurrences in 
net/netfilter/nf_conntrack_netlink.c where nlmsg_cancel is reachable with 
the second argument NULL.

For nlmsg_cancel the ability to accept NULL as a second argument comes 
from the fact that it only calls nlmsg_trim, which does nothing if NULL is 
the second argument.  nlmsg_trim is also called by nla_nest_cancel.  There 
are many calls to nla_nest_cancel with NULL as the second argument in the 
directory net/sched, for example in the function gred_dump in 
net/sched/sch_gred.c.  net/sched also contains a call to nlmsg_trim with 
NULL as the second argument, in the function flow_dump, in 
net/sched/cls_flow.c.

The whole thing seems somewhat sloppy.  I'm sure that all of the 
above-cited occurrences could be rewritten as outlined above to skip over 
the cancel/trim function.

julia

^ permalink raw reply

* Re: Network performance with small packets
From: Shirley Ma @ 2011-02-02  6:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm, mashirle,
	netdev
In-Reply-To: <1296626748.26937.852.camel@localhost.localdomain>

On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
> 
> The way I am changing is only when netif queue has stopped, then we
> start to count num_free descriptors to send the signal to wake netif
> queue. 

I forgot to mention, the code change I am making is in guest kernel, in
xmit call back only wake up the queue when it's stopped && num_free >=
1/2 *vq->num, I add a new API in virtio_ring.

However vhost signaling reduction is needed as well. The patch I
submitted a while ago showed both CPUs and BW improvement.

Thanks
Shirley

^ permalink raw reply

* Re: Network performance with small packets
From: Michael S. Tsirkin @ 2011-02-02  6:29 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm, mashirle,
	netdev
In-Reply-To: <1296627549.26937.856.camel@localhost.localdomain>

On Tue, Feb 01, 2011 at 10:19:09PM -0800, Shirley Ma wrote:
> On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
> > 
> > The way I am changing is only when netif queue has stopped, then we
> > start to count num_free descriptors to send the signal to wake netif
> > queue. 
> 
> I forgot to mention, the code change I am making is in guest kernel, in
> xmit call back only wake up the queue when it's stopped && num_free >=
> 1/2 *vq->num, I add a new API in virtio_ring.

Interesting. Yes, I agree an API extension would be helpful. However,
wouldn't just the signaling reduction be enough, without guest changes?

> However vhost signaling reduction is needed as well. The patch I
> submitted a while ago showed both CPUs and BW improvement.
> 
> Thanks
> Shirley

Which patch was that?

-- 
MST

^ permalink raw reply

* Re: Network performance with small packets
From: Krishna Kumar2 @ 2011-02-02  6:34 UTC (permalink / raw)
  To: Shirley Ma
  Cc: David Miller, kvm, mashirle, Michael S. Tsirkin, netdev,
	netdev-owner, Sridhar Samudrala, Steve Dobbelstein
In-Reply-To: <1296627549.26937.856.camel@localhost.localdomain>

> On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
> >
> > The way I am changing is only when netif queue has stopped, then we
> > start to count num_free descriptors to send the signal to wake netif
> > queue.
>
> I forgot to mention, the code change I am making is in guest kernel, in
> xmit call back only wake up the queue when it's stopped && num_free >=
> 1/2 *vq->num, I add a new API in virtio_ring.

FYI :)

I have tried this before. There are a couple of issues:

1. the free count will not reduce until you run free_old_xmit_skbs,
   which will not run anymore since the tx queue is stopped.
2. You cannot call free_old_xmit_skbs directly as it races with a
   queue that was just awakened (current cb was due to the delay
   in disabling cb's).

You have to call free_old_xmit_skbs() under netif_queue_stopped()
check to avoid the race.

I got a small improvement in my testing upto some number of threads
(32 or 48?), but beyond that I was getting a regression.

Thanks,

- KK

> However vhost signaling reduction is needed as well. The patch I
> submitted a while ago showed both CPUs and BW improvement.

^ permalink raw reply

* Re: Network performance with small packets
From: Shirley Ma @ 2011-02-02  7:03 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: David Miller, kvm, mashirle, Michael S. Tsirkin, netdev,
	netdev-owner, Sridhar Samudrala, Steve Dobbelstein
In-Reply-To: <OFF5778D3C.84F46700-ON6525782B.00230A94-6525782B.00240890@in.ibm.com>

On Wed, 2011-02-02 at 12:04 +0530, Krishna Kumar2 wrote:
> > On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
> > >
> > > The way I am changing is only when netif queue has stopped, then
> we
> > > start to count num_free descriptors to send the signal to wake
> netif
> > > queue.
> >
> > I forgot to mention, the code change I am making is in guest kernel,
> in
> > xmit call back only wake up the queue when it's stopped && num_free
> >=
> > 1/2 *vq->num, I add a new API in virtio_ring.
> 
> FYI :)

> I have tried this before. There are a couple of issues:
> 
> 1. the free count will not reduce until you run free_old_xmit_skbs,
>    which will not run anymore since the tx queue is stopped.
> 2. You cannot call free_old_xmit_skbs directly as it races with a
>    queue that was just awakened (current cb was due to the delay
>    in disabling cb's).
> 
> You have to call free_old_xmit_skbs() under netif_queue_stopped()
> check to avoid the race.

Yes, that' what I did, when the netif queue stop, don't enable the
queue, just free_old_xmit_skbs(), if not enough freed, then enabling
callback until half of the ring size are freed, then wake the netif
queue. But somehow I didn't reach the performance compared to drop
packets, need to think about it more. :)

Thanks
Shirley


^ permalink raw reply

* Re: Network performance with small packets
From: Shirley Ma @ 2011-02-02  7:14 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm, mashirle,
	netdev
In-Reply-To: <20110202062950.GD3818@redhat.com>

On Wed, 2011-02-02 at 08:29 +0200, Michael S. Tsirkin wrote:
> On Tue, Feb 01, 2011 at 10:19:09PM -0800, Shirley Ma wrote:
> > On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
> > > 
> > > The way I am changing is only when netif queue has stopped, then
> we
> > > start to count num_free descriptors to send the signal to wake
> netif
> > > queue. 
> > 
> > I forgot to mention, the code change I am making is in guest kernel,
> in
> > xmit call back only wake up the queue when it's stopped && num_free
> >=
> > 1/2 *vq->num, I add a new API in virtio_ring.
> 
> Interesting. Yes, I agree an API extension would be helpful. However,
> wouldn't just the signaling reduction be enough, without guest
> changes?

w/i guest change, I played around the parameters,for example: I could
get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message size,
w/i dropping packet, I was able to get up to 6.2Gb/s with similar CPU
usage.

> > However vhost signaling reduction is needed as well. The patch I
> > submitted a while ago showed both CPUs and BW improvement.
> > 
> > Thanks
> > Shirley
> 
> Which patch was that? 

The patch was called "vhost: TX used buffer guest signal accumulation".
You suggested to split add_used_bufs and signal. I am still thinking
what's the best approach to cooperate guest (virtio_kick) and
vhost(handle_tx), vhost(signaling) and guest (xmit callback) to reduce
the overheads, so I haven't submit the new patch yet.

Thanks
Shirley

^ permalink raw reply

* Re: Network performance with small packets
From: Shirley Ma @ 2011-02-02  7:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm, mashirle,
	netdev
In-Reply-To: <1296630891.26937.870.camel@localhost.localdomain>

On Tue, 2011-02-01 at 23:14 -0800, Shirley Ma wrote:
> w/i guest change, I played around the parameters,for example: I could
> get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message
> size,
> w/i dropping packet, I was able to get up to 6.2Gb/s with similar CPU
> usage. 

I meant w/o guest change, only vhost changes. Sorry about that.

Shirley


^ permalink raw reply

* Re: Network performance with small packets
From: Krishna Kumar2 @ 2011-02-02  7:37 UTC (permalink / raw)
  To: Shirley Ma
  Cc: David Miller, kvm, mashirle, Michael S. Tsirkin, netdev,
	netdev-owner, Sridhar Samudrala, Steve Dobbelstein
In-Reply-To: <1296630226.26937.859.camel@localhost.localdomain>

> Shirley Ma <mashirle@us.ibm.com> wrote:
>
> > I have tried this before. There are a couple of issues:
> >
> > 1. the free count will not reduce until you run free_old_xmit_skbs,
> >    which will not run anymore since the tx queue is stopped.
> > 2. You cannot call free_old_xmit_skbs directly as it races with a
> >    queue that was just awakened (current cb was due to the delay
> >    in disabling cb's).
> >
> > You have to call free_old_xmit_skbs() under netif_queue_stopped()
> > check to avoid the race.
>
> Yes, that' what I did, when the netif queue stop, don't enable the
> queue, just free_old_xmit_skbs(), if not enough freed, then enabling
> callback until half of the ring size are freed, then wake the netif
> queue. But somehow I didn't reach the performance compared to drop
> packets, need to think about it more. :)

Did you check if the number of vmexits increased with this
patch? This is possible if the device was keeping up (and
not going into a stop, start, xmit 1 packet, stop, start
loop). Also maybe you should try for 1/4th instead of 1/2?

MST's delayed signalling should avoid this issue, I haven't
tried both together.

Thanks,

- KK


^ permalink raw reply

* RE: Using ethernet device as efficient small packet generator
From: juice @ 2011-02-02  8:13 UTC (permalink / raw)
  To: Brandeburg, Jesse, Loke, Chetan, Jon Zhou, Eric Dumazet,
	"Stephen Hemming
In-Reply-To: <8ad1defdf427ceb7af94fad4d216b006.squirrel@www.liukuma.net>

>
>> your computation of Bandwidth (as Ben Greear said) is not accounting for
>> the interframe gaps.  Maybe more useful is to note that wire speed 64
>> byte packets is 1.44 Million packets per second.
>
> I am aware of the fact that interframe gap eats away some of the bandwidth
> from actual data bytes, and I am taking that into consideration.
> My benchmark here is the Spirent AX4000 network analyzer, which can send
> and receive full utilization of GE line.
>
> The measurement when sending full line rate from AX4000 are:
>   Total bitrate:             761.903 MBits/s
>   Packet rate:               1488090 packets/s
>   Bandwidth:                 76.19% GE
>   Average packet intereval:  0.67 us
>
>
>> I think you need different hardware (again) as you have saddled yourself
>> with a x1 PCIe connected adapter.  This adapter is not well suited to
>> small packet traffic because the sheer amount of transactions is
>> effected
>> by the added latency due to the x1 connector (vs our dual port 1GbE
>> adapters with a x4 connector)
>>
>> with Core i3/5/7 or newer cpus you should be able to saturate a 1Gb link
>> with a single core/queue.  With Core2 era processors you may have some
>> difficulty, with anything older than that you won't make it. :-)
>>
>> My suggestion is to get one of the igb based adapters, 82576, or 82580
>> based that run the igb driver.
>>
>> If you can't get a hold of those you should be able to easily get 1.1M
>> pps from an 82571 adapter.
>
> I will order the 82576 card and try my tests with that.
>

Okay, now I just installed the new hot 82576 DualGE adapter and compiled
the igb module for 2.6.38-rc2 kernel I am running on.

The results with this adapter look very promising, now I am able to reach
the full GE bandwidth with 64 byte packets with only interrupt cpu affinity
tuning, no other tweaks needed:

root@d8labralinux:/var/home/juice/pkt_test# cat /proc/net/pktgen/eth1
Params: count 10000000  min_pkt_size: 60  max_pkt_size: 60
     frags: 0  delay: 0  clone_skb: 0  ifname: eth1
     flows: 0 flowlen: 0
     queue_map_min: 0  queue_map_max: 0
     dst_min: 10.10.11.2  dst_max:
        src_min:   src_max:
     src_mac: 00:1b:21:97:21:76 dst_mac: 00:04:23:08:91:dc
     udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9  udp_dst_max: 9
     src_mac_count: 0  dst_mac_count: 0
     Flags:
Current:
     pkts-sofar: 10000000  errors: 0
     started: 1941436194us  stopped: 1948155853us idle: 179us
     seq_num: 10000001  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
     cur_saddr: 0x0  cur_daddr: 0x20b0a0a
     cur_udp_dst: 9  cur_udp_src: 9
     cur_queue_map: 0
     flows: 0
Result: OK: 6719658(c6719479+d179) nsec, 10000000 (60byte,0frags)
  1488170pps 714Mb/sec (714321600bps) errors: 0

AX4000 measurements:
   Total bitrate:             761.910 MBits/s
   Packet rate:               1488106 packets/s
   Bandwidth:                 76.19% GE
   Average packet intereval:  0.67 us

Now, I need to check if I can send similar rates from userspace socket
interface. If that is possible then it may be so that I do not even need
to create a kernel driver for my application.

Yours, Jussi Ohenoja



^ permalink raw reply

* Re: [PATCH v3 1/3] iproute2: add support for setting device groups
From: Patrick McHardy @ 2011-02-02  8:56 UTC (permalink / raw)
  To: Vlad Dogaru; +Cc: netdev, Stephen Hemminger
In-Reply-To: <1296060086-18777-2-git-send-email-ddvlad@rosedu.org>

On 26.01.2011 17:41, Vlad Dogaru wrote:
> Use the group keyword to specify what group the device should belong to.
> Since the kernel uses numbers internally, mapping of group names to
> numbers is defined in /etc/iproute2/group_map. Example usage:
> 
>   ip link set dev eth0 group default
> 
> @@ -297,6 +299,13 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req,
>  			if (get_integer(&mtu, *argv, 0))
>  				invarg("Invalid \"mtu\" value\n", *argv);
>  			addattr_l(&req->n, sizeof(*req), IFLA_MTU, &mtu, 4);
> +		} else if (strcmp(*argv, "group") == 0) {
> +			NEXT_ARG();
> +			if (group != -1)
> +				duparg("group", *argv);
> +			if (lookup_map_id(*argv, &group, GROUP_MAP))
> +				invarg("Invalid \"group\" value\n", *argv);
> +			addattr_l(&req->n, sizeof(*req), IFLA_GROUP, &group, 4);

I think it would be preferrable to use a function similar to
rt_realm_n2a() that can also handle plain numerical values.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox