Netdev List
 help / color / mirror / Atom feed
* Re:,,,,,
From: young chang @ 2011-02-02 13:47 UTC (permalink / raw)


May I ask if you would be eligible to pursue a Business Proposal of $19.7m with me if you dont mind? Let me know if you are interested.

^ permalink raw reply

* Re: Network performance with small packets
From: Michael S. Tsirkin @ 2011-02-02  4:42 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: David Miller, kvm, Shirley Ma, netdev, rusty, steved
In-Reply-To: <OF5CFD7B6A.4CBE4E3B-ON6525782B.0014DC5B-6525782B.001979CB@in.ibm.com>

On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote:
> > "Michael S. Tsirkin" <mst@redhat.com> 02/02/2011 03:11 AM
> >
> > On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote:
> > > On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote:
> > > > Confused. We compare capacity to skb frags, no?
> > > > That's sg I think ...
> > >
> > > Current guest kernel use indirect buffers, num_free returns how many
> > > available descriptors not skb frags. So it's wrong here.
> > >
> > > Shirley
> >
> > I see. Good point. In other words when we complete the buffer
> > it was indirect, but when we add a new one we
> > can not allocate indirect so we consume.
> > And then we start the queue and add will fail.
> > I guess we need some kind of API to figure out
> > whether the buf we complete was indirect?
> >
> > Another failure mode is when skb_xmit_done
> > wakes the queue: it might be too early, there
> > might not be space for the next packet in the vq yet.
> 
> I am not sure if this is the problem - shouldn't you
> see these messages:
> 	if (likely(capacity == -ENOMEM)) {
> 		dev_warn(&dev->dev,
> 			"TX queue failure: out of memory\n");
> 	} else {
> 		dev->stats.tx_fifo_errors++;
> 		dev_warn(&dev->dev,
> 			"Unexpected TX queue failure: %d\n",
> 			capacity);
> 	}
> in next xmit? I am not getting this in my testing.

Yes, I don't think we hit this in our testing,
simply because we don't stress memory.
Disable indirect, then you might see this.

> > A solution might be to keep some kind of pool
> > around for indirect, we wanted to do it for block anyway ...
> 
> Your vhost patch should fix this automatically. Right?

Reduce the chance of it happening, yes.

> 
> Thanks,
> 
> - KK

^ permalink raw reply

* Re: Network performance with small packets
From: Michael S. Tsirkin @ 2011-02-02  4:40 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm, mashirle,
	netdev
In-Reply-To: <1296601197.26937.833.camel@localhost.localdomain>

On Tue, Feb 01, 2011 at 02:59:57PM -0800, Shirley Ma wrote:
> On Tue, 2011-02-01 at 23:56 +0200, Michael S. Tsirkin wrote:
> > There are flags for bytes, buffers and packets.
> > Try playing with any one of them :)
> > Just be sure to use v2.
> > 
> > 
> > >I would like to change it to
> > > half of the ring size instead for signaling. Is that OK?
> > > 
> > > Shirley
> > > 
> > > 
> > 
> > Sure that is why I made it a parameter so you can experiment. 
> 
> The initial test results shows that the CPUs utilization has been
> reduced some, and BW has increased some with the default parameters,
> like 1K message size BW goes from 2.5Gb/s about 2.8Gb/s, CPU utilization
> down from 4x% to 38%, (Similar results from the patch I submitted a
> while ago to reduce signaling on vhost) but far away from dropping
> packet results.
> 
> I am going to change the code to use 1/2 ring size to wake the netif
> queue.
> 
> Shirley

Just tweak the parameters with sysfs, you do not have to edit the code:
echo 64 > /sys/module/vhost_net/parameters/tx_bufs_coalesce

Or in a similar way for tx_packets_coalesce (since we use indirect,
packets will typically use 1 buffer each).

-- 
MST

^ permalink raw reply

* I have a question about BIGN
From: Bill Gross @ 2011-02-02  4:15 UTC (permalink / raw)
  To: netdev



I was searching online to find more info about BIGN
and I came across your information.
 
Can you tell me, are you still involved with BIGN? If
you are, how are things going for you?
 
Please let me know.
 
Thanks
Bill Gross
 
 





^ permalink raw reply

* Re: Network performance with small packets
From: Krishna Kumar2 @ 2011-02-02  4:39 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: David Miller, kvm, Shirley Ma, netdev, rusty, steved
In-Reply-To: <20110201214114.GA31105@redhat.com>

> "Michael S. Tsirkin" <mst@redhat.com> 02/02/2011 03:11 AM
>
> On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote:
> > On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote:
> > > Confused. We compare capacity to skb frags, no?
> > > That's sg I think ...
> >
> > Current guest kernel use indirect buffers, num_free returns how many
> > available descriptors not skb frags. So it's wrong here.
> >
> > Shirley
>
> I see. Good point. In other words when we complete the buffer
> it was indirect, but when we add a new one we
> can not allocate indirect so we consume.
> And then we start the queue and add will fail.
> I guess we need some kind of API to figure out
> whether the buf we complete was indirect?
>
> Another failure mode is when skb_xmit_done
> wakes the queue: it might be too early, there
> might not be space for the next packet in the vq yet.

I am not sure if this is the problem - shouldn't you
see these messages:
	if (likely(capacity == -ENOMEM)) {
		dev_warn(&dev->dev,
			"TX queue failure: out of memory\n");
	} else {
		dev->stats.tx_fifo_errors++;
		dev_warn(&dev->dev,
			"Unexpected TX queue failure: %d\n",
			capacity);
	}
in next xmit? I am not getting this in my testing.

> A solution might be to keep some kind of pool
> around for indirect, we wanted to do it for block anyway ...

Your vhost patch should fix this automatically. Right?

Thanks,

- KK


^ permalink raw reply

* Re: [PATCHv2 dontapply] vhost-net tx tuning
From: Michael S. Tsirkin @ 2011-02-02  4:36 UTC (permalink / raw)
  To: Sridhar Samudrala; +Cc: Steve Dobbelstein, mashirle, kvm, netdev
In-Reply-To: <1296601658.30191.46.camel@sridhar.beaverton.ibm.com>

On Tue, Feb 01, 2011 at 03:07:38PM -0800, Sridhar Samudrala wrote:
> On Tue, 2011-02-01 at 17:52 +0200, Michael S. Tsirkin wrote:
> > OK, so thinking about it more, maybe the issue is this:
> > tx becomes full. We process one request and interrupt the guest,
> > then it adds one request and the queue is full again.
> > 
> > Maybe the following will help it stabilize?  By default with it we will
> > only interrupt when we see an empty ring.
> > Which is liklely too much: pls try other values
> > in the middle: e.g. make bufs half the ring,
> > or bytes some small value like half ring * 200, or packets some
> > small value etc.
> > 
> > Set any one parameter to 0 to get current
> > behaviour (interrupt immediately when enabled).
> > 
> > Warning: completely untested.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > 
> > ---
> > 
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index aac05bc..6769cdc 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -32,6 +32,13 @@
> >   * Using this limit prevents one virtqueue from starving others. */
> >  #define VHOST_NET_WEIGHT 0x80000
> > 
> > +int tx_bytes_coalesce = 1000000000;
> > +module_param(tx_bytes_coalesce, int, 0644);
> > +int tx_bufs_coalesce = 1000000000;
> > +module_param(tx_bufs_coalesce, int, 0644);
> > +int tx_packets_coalesce = 1000000000;
> > +module_param(tx_packets_coalesce, int, 0644);
> > +
> >  enum {
> >  	VHOST_NET_VQ_RX = 0,
> >  	VHOST_NET_VQ_TX = 1,
> > @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net)
> >  	int err, wmem;
> >  	size_t hdr_size;
> >  	struct socket *sock;
> > +	int bytes_coalesced = 0;
> > +	int bufs_coalesced = 0;
> > +	int packets_coalesced = 0;
> > 
> >  	/* TODO: check that we are running from vhost_worker? */
> >  	sock = rcu_dereference_check(vq->private_data, 1);
> > @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net)
> >  		if (err != len)
> >  			pr_debug("Truncated TX packet: "
> >  				 " len %d != %zd\n", err, len);
> > -		vhost_add_used_and_signal(&net->dev, vq, head, 0);
> >  		total_len += len;
> > +		packets_coalesced += 1;
> > +		bytes_coalesced += len;
> > +		bufs_coalesced += out;
> > +		if (unlikely(packets_coalesced > tx_packets_coalesce ||
> > +			     bytes_coalesced > tx_bytes_coalesce ||
> > +			     bufs_coalesced > tx_bufs_coalesce))
> > +			vhost_add_used_and_signal(&net->dev, vq, head, 0);
> 
> I think the counters that exceed the limits need to be reset to 0 here.
> Otherwise we keep signaling for every buffer once we hit this condition.
> 
> Thanks
> Sridhar

Correct, good catch.

> > +		else
> > +			vhost_add_used(vq, head, 0);
> >  		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> >  			vhost_poll_queue(&vq->poll);
> >  			break;
> >  		}
> >  	}
> > 
> > +	if (likely(packets_coalesced &&
> > +		   bytes_coalesced &&
> > +		   bufs_coalesced))
> > +		vhost_signal(&net->dev, vq);
> >  	mutex_unlock(&vq->mutex);
> >  }
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] ipv4: Remove fib_hash.
From: David Miller @ 2011-02-02  2:15 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1296607693.2607.7.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 02 Feb 2011 01:48:13 +0100

> Hmm... I know having to maintain two implementations is time consuming,
> but I know fib_trie is bigger :
> 
> # size net/ipv4/fib_*.o
>    text	   data	    bss	    dec	    hex	filename
>    7252	    120	      0	   7372	   1ccc	net/ipv4/fib_frontend.o
>    7279	     16	      4	   7299	   1c83	net/ipv4/fib_hash.o
>    1479	      0	      0	   1479	    5c7	net/ipv4/fib_rules.o
>    7885	      0	   2080	   9965	   26ed	net/ipv4/fib_semantics.o
>   16222	     16	     16	  16254	   3f7e	net/ipv4/fib_trie.o
> 
> In my tests, I know that fib_trie is more expensive for typical routing
> tables for hosts (no more than a dozen or entries), in latencies
> results, mostly because of icache misses, but also dcache ones.

It's mostly the rebalancing code that takes up the space.

The lookup path is on the same order of magnitude as the fib hash
stuff was.

In any event, we have several months to hash out any regressions and I
think if I didn't do this removal nobody would work on it so... :-)

^ permalink raw reply

* Re: [PATCH V10 15/15] ptp: Added a clock driver for the National Semiconductor PHYTER.
From: john stultz @ 2011-02-02  2:11 UTC (permalink / raw)
  To: Richard Cochran
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	Alan Cox, Arnd Bergmann, Christoph Lameter, David Miller,
	Krzysztof Halasa, Peter Zijlstra, Rodolfo Giometti,
	Thomas Gleixner, Benjamin Herrenschmidt, H. Peter Anvin,
	Ingo Molnar, Mike Frysinger, Paul Mackerras, Russell King
In-Reply-To: <9a8c774950f1a78259a49681ee2fe7977ace44d6.1296124770.git.richard.cochran-3mrvs1K0uXizZXS1Dc/lvw@public.gmane.org>

On Thu, 2011-01-27 at 12:00 +0100, Richard Cochran wrote:
> This patch adds support for the PTP clock found on the DP83640.
> The basic clock operations and one external time stamp have
> been implemented.
> 
> Signed-off-by: Richard Cochran <richard.cochran-3mrvs1K0uXizZXS1Dc/lvw@public.gmane.org>

Acked-by: John Stultz <johnstul-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

^ permalink raw reply

* Re: [PATCH V10 14/15] ptp: Added a clock driver for the IXP46x.
From: john stultz @ 2011-02-02  2:10 UTC (permalink / raw)
  To: Richard Cochran
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	Alan Cox, Arnd Bergmann, Christoph Lameter, David Miller,
	Krzysztof Halasa, Peter Zijlstra, Rodolfo Giometti,
	Thomas Gleixner, Benjamin Herrenschmidt, H. Peter Anvin,
	Ingo Molnar, Mike Frysinger, Paul Mackerras, Russell King
In-Reply-To: <0319138702479bc484596a2b10c681a7d8313b98.1296124770.git.richard.cochran-3mrvs1K0uXizZXS1Dc/lvw@public.gmane.org>

On Thu, 2011-01-27 at 12:00 +0100, Richard Cochran wrote:
> This patch adds a driver for the hardware time stamping unit found on the
> IXP465. The basic clock operations and an external trigger are implemented.
> 
> Signed-off-by: Richard Cochran <richard.cochran-3mrvs1K0uXizZXS1Dc/lvw@public.gmane.org>

Minor nit below, but otherwise,
Acked-by: John Stultz <johnstul-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

> diff --git a/drivers/ptp/ptp_ixp46x.c b/drivers/ptp/ptp_ixp46x.c
> new file mode 100644
> index 0000000..c860030
> --- /dev/null
> +++ b/drivers/ptp/ptp_ixp46x.c
[snip]
> +static int __init ptp_ixp_init(void)
> +{
> +	if (!cpu_is_ixp46x())
> +		return -ENODEV;
> +
> +	ixp_clock.regs = 
> +		(struct ixp46x_ts_regs __iomem *)IXP4XX_TIMESYNC_BASE_VIRT;

Minor whitespace issue here

^ permalink raw reply

* Re: [PATCH V10 13/15] ptp: Added a clock that uses the eTSEC found on the MPC85xx.
From: john stultz @ 2011-02-02  2:04 UTC (permalink / raw)
  To: Richard Cochran
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	Alan Cox, Arnd Bergmann, Christoph Lameter, David Miller,
	Krzysztof Halasa, Peter Zijlstra, Rodolfo Giometti,
	Thomas Gleixner, Benjamin Herrenschmidt, H. Peter Anvin,
	Ingo Molnar, Mike Frysinger, Paul Mackerras, Russell King
In-Reply-To: <606a5342baa53c6aa30b3bc374d5ef2d50ddfc0e.1296124770.git.richard.cochran-3mrvs1K0uXizZXS1Dc/lvw@public.gmane.org>

On Thu, 2011-01-27 at 11:59 +0100, Richard Cochran wrote:
> The eTSEC includes a PTP clock with quite a few features. This patch adds
> support for the basic clock adjustment functions, plus two external time
> stamps, one alarm, and the PPS callback.
> 
> Signed-off-by: Richard Cochran <richard.cochran-3mrvs1K0uXizZXS1Dc/lvw@public.gmane.org>

This looks ok to me.

Acked-by: John Stultz <johnstul-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

> ---
>  Documentation/powerpc/dts-bindings/fsl/tsec.txt |   57 +++
>  arch/powerpc/boot/dts/mpc8313erdb.dts           |   14 +
>  arch/powerpc/boot/dts/mpc8572ds.dts             |   14 +
>  arch/powerpc/boot/dts/p2020ds.dts               |   14 +
>  arch/powerpc/boot/dts/p2020rdb.dts              |   14 +
>  drivers/net/Makefile                            |    1 +
>  drivers/net/gianfar_ptp.c                       |  448 +++++++++++++++++++++++
>  drivers/net/gianfar_ptp_reg.h                   |  113 ++++++
>  drivers/ptp/Kconfig                             |   13 +
>  9 files changed, 688 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/net/gianfar_ptp.c
>  create mode 100644 drivers/net/gianfar_ptp_reg.h
> 
> diff --git a/Documentation/powerpc/dts-bindings/fsl/tsec.txt b/Documentation/powerpc/dts-bindings/fsl/tsec.txt
> index edb7ae1..f6edbb8 100644
> --- a/Documentation/powerpc/dts-bindings/fsl/tsec.txt
> +++ b/Documentation/powerpc/dts-bindings/fsl/tsec.txt
> @@ -74,3 +74,60 @@ Example:
>  		interrupt-parent = <&mpic>;
>  		phy-handle = <&phy0>
>  	};
> +
> +* Gianfar PTP clock nodes
> +
> +General Properties:
> +
> +  - compatible   Should be "fsl,etsec-ptp"
> +  - reg          Offset and length of the register set for the device
> +  - interrupts   There should be at least two interrupts. Some devices
> +                 have as many as four PTP related interrupts.
> +
> +Clock Properties:
> +
> +  - tclk-period  Timer reference clock period in nanoseconds.
> +  - tmr-prsc     Prescaler, divides the output clock.
> +  - tmr-add      Frequency compensation value.
> +  - cksel        0= external clock, 1= eTSEC system clock, 3= RTC clock input.
> +                 Currently the driver only supports choice "1".
> +  - tmr-fiper1   Fixed interval period pulse generator.
> +  - tmr-fiper2   Fixed interval period pulse generator.
> +  - max-adj      Maximum frequency adjustment in parts per billion.
> +
> +  These properties set the operational parameters for the PTP
> +  clock. You must choose these carefully for the clock to work right.
> +  Here is how to figure good values:
> +
> +  TimerOsc     = system clock               MHz
> +  tclk_period  = desired clock period       nanoseconds
> +  NominalFreq  = 1000 / tclk_period         MHz
> +  FreqDivRatio = TimerOsc / NominalFreq     (must be greater that 1.0)
> +  tmr_add      = ceil(2^32 / FreqDivRatio)
> +  OutputClock  = NominalFreq / tmr_prsc     MHz
> +  PulseWidth   = 1 / OutputClock            microseconds
> +  FiperFreq1   = desired frequency in Hz
> +  FiperDiv1    = 1000000 * OutputClock / FiperFreq1
> +  tmr_fiper1   = tmr_prsc * tclk_period * FiperDiv1 - tclk_period
> +  max_adj      = 1000000000 * (FreqDivRatio - 1.0) - 1
> +
> +  The calculation for tmr_fiper2 is the same as for tmr_fiper1. The
> +  driver expects that tmr_fiper1 will be correctly set to produce a 1
> +  Pulse Per Second (PPS) signal, since this will be offered to the PPS
> +  subsystem to synchronize the Linux clock.
> +
> +Example:
> +
> +	ptp_clock@24E00 {
> +		compatible = "fsl,etsec-ptp";
> +		reg = <0x24E00 0xB0>;
> +		interrupts = <12 0x8 13 0x8>;
> +		interrupt-parent = < &ipic >;
> +		tclk-period = <10>;
> +		tmr-prsc    = <100>;
> +		tmr-add     = <0x999999A4>;
> +		cksel       = <0x1>;
> +		tmr-fiper1  = <0x3B9AC9F6>;
> +		tmr-fiper2  = <0x00018696>;
> +		max-adj     = <659999998>;
> +	};
> diff --git a/arch/powerpc/boot/dts/mpc8313erdb.dts b/arch/powerpc/boot/dts/mpc8313erdb.dts
> index 183f2aa..85a7eaa 100644
> --- a/arch/powerpc/boot/dts/mpc8313erdb.dts
> +++ b/arch/powerpc/boot/dts/mpc8313erdb.dts
> @@ -208,6 +208,20 @@
>  			sleep = <&pmc 0x00300000>;
>  		};
> 
> +		ptp_clock@24E00 {
> +			compatible = "fsl,etsec-ptp";
> +			reg = <0x24E00 0xB0>;
> +			interrupts = <12 0x8 13 0x8>;
> +			interrupt-parent = < &ipic >;
> +			tclk-period = <10>;
> +			tmr-prsc    = <100>;
> +			tmr-add     = <0x999999A4>;
> +			cksel       = <0x1>;
> +			tmr-fiper1  = <0x3B9AC9F6>;
> +			tmr-fiper2  = <0x00018696>;
> +			max-adj     = <659999998>;
> +		};
> +
>  		enet0: ethernet@24000 {
>  			#address-cells = <1>;
>  			#size-cells = <1>;
> diff --git a/arch/powerpc/boot/dts/mpc8572ds.dts b/arch/powerpc/boot/dts/mpc8572ds.dts
> index cafc128..74208cd 100644
> --- a/arch/powerpc/boot/dts/mpc8572ds.dts
> +++ b/arch/powerpc/boot/dts/mpc8572ds.dts
> @@ -324,6 +324,20 @@
>  			};
>  		};
> 
> +		ptp_clock@24E00 {
> +			compatible = "fsl,etsec-ptp";
> +			reg = <0x24E00 0xB0>;
> +			interrupts = <68 2 69 2 70 2 71 2>;
> +			interrupt-parent = < &mpic >;
> +			tclk-period = <5>;
> +			tmr-prsc = <200>;
> +			tmr-add = <0xAAAAAAAB>;
> +			cksel = <1>;
> +			tmr-fiper1 = <0x3B9AC9FB>;
> +			tmr-fiper2 = <0x3B9AC9FB>;
> +			max-adj = <499999999>;
> +		};
> +
>  		enet0: ethernet@24000 {
>  			#address-cells = <1>;
>  			#size-cells = <1>;
> diff --git a/arch/powerpc/boot/dts/p2020ds.dts b/arch/powerpc/boot/dts/p2020ds.dts
> index 1101914..39d73bb 100644
> --- a/arch/powerpc/boot/dts/p2020ds.dts
> +++ b/arch/powerpc/boot/dts/p2020ds.dts
> @@ -336,6 +336,20 @@
>  			phy_type = "ulpi";
>  		};
> 
> +		ptp_clock@24E00 {
> +			compatible = "fsl,etsec-ptp";
> +			reg = <0x24E00 0xB0>;
> +			interrupts = <68 2 69 2 70 2>;
> +			interrupt-parent = < &mpic >;
> +			tclk-period = <5>;
> +			tmr-prsc = <200>;
> +			tmr-add = <0xCCCCCCCD>;
> +			cksel = <1>;
> +			tmr-fiper1 = <0x3B9AC9FB>;
> +			tmr-fiper2 = <0x0001869B>;
> +			max-adj = <249999999>;
> +		};
> +
>  		enet0: ethernet@24000 {
>  			#address-cells = <1>;
>  			#size-cells = <1>;
> diff --git a/arch/powerpc/boot/dts/p2020rdb.dts b/arch/powerpc/boot/dts/p2020rdb.dts
> index da4cb0d..5498fb9 100644
> --- a/arch/powerpc/boot/dts/p2020rdb.dts
> +++ b/arch/powerpc/boot/dts/p2020rdb.dts
> @@ -396,6 +396,20 @@
>  			phy_type = "ulpi";
>  		};
> 
> +		ptp_clock@24E00 {
> +			compatible = "fsl,etsec-ptp";
> +			reg = <0x24E00 0xB0>;
> +			interrupts = <68 2 69 2 70 2>;
> +			interrupt-parent = < &mpic >;
> +			tclk-period = <5>;
> +			tmr-prsc = <200>;
> +			tmr-add = <0xCCCCCCCD>;
> +			cksel = <1>;
> +			tmr-fiper1 = <0x3B9AC9FB>;
> +			tmr-fiper2 = <0x0001869B>;
> +			max-adj = <249999999>;
> +		};
> +
>  		enet0: ethernet@24000 {
>  			#address-cells = <1>;
>  			#size-cells = <1>;
> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> index b90738d..c303f5f 100644
> --- a/drivers/net/Makefile
> +++ b/drivers/net/Makefile
> @@ -31,6 +31,7 @@ obj-$(CONFIG_ATL2) += atlx/
>  obj-$(CONFIG_ATL1E) += atl1e/
>  obj-$(CONFIG_ATL1C) += atl1c/
>  obj-$(CONFIG_GIANFAR) += gianfar_driver.o
> +obj-$(CONFIG_PTP_1588_CLOCK_GIANFAR) += gianfar_ptp.o
>  obj-$(CONFIG_TEHUTI) += tehuti.o
>  obj-$(CONFIG_ENIC) += enic/
>  obj-$(CONFIG_JME) += jme.o
> diff --git a/drivers/net/gianfar_ptp.c b/drivers/net/gianfar_ptp.c
> new file mode 100644
> index 0000000..84fff15
> --- /dev/null
> +++ b/drivers/net/gianfar_ptp.c
> @@ -0,0 +1,448 @@
> +/*
> + * PTP 1588 clock using the eTSEC
> + *
> + * Copyright (C) 2010 OMICRON electronics GmbH
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License, or
> + *  (at your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + *  GNU General Public License for more details.
> + *
> + *  You should have received a copy of the GNU General Public License
> + *  along with this program; if not, write to the Free Software
> + *  Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
> + */
> +#include <linux/device.h>
> +#include <linux/hrtimer.h>
> +#include <linux/init.h>
> +#include <linux/interrupt.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/of.h>
> +#include <linux/of_platform.h>
> +#include <linux/timex.h>
> +#include <linux/io.h>
> +
> +#include <linux/ptp_clock_kernel.h>
> +
> +#include "gianfar_ptp_reg.h"
> +#include "gianfar.h"
> +
> +#define DRIVER		"gianfar_ptp"
> +#define N_ALARM		1 /* first alarm is used internally to reset fipers */
> +#define N_EXT_TS	2
> +#define REG_SIZE	sizeof(struct gianfar_ptp_registers)
> +
> +struct etsects {
> +	struct gianfar_ptp_registers *regs;
> +	struct ptp_clock *clock;
> +	struct ptp_clock_info caps;
> +	int irq;
> +	u64 alarm_interval; /* for periodic alarm */
> +	u64 alarm_value;
> +	u32 tclk_period;  /* nanoseconds */
> +	u32 tmr_prsc;
> +	u32 tmr_add;
> +	u32 cksel;
> +	u32 tmr_fiper1;
> +	u32 tmr_fiper2;
> +};
> +
> +/* Private globals */
> +static struct etsects the_clock;
> +DEFINE_SPINLOCK(register_lock);
> +
> +/*
> + * Register access functions
> + */
> +
> +static u64 tmr_cnt_read(struct etsects *etsects)
> +{
> +	u64 ns;
> +	u32 lo, hi;
> +
> +	lo = gfar_read(&etsects->regs->tmr_cnt_l);
> +	hi = gfar_read(&etsects->regs->tmr_cnt_h);
> +	ns = ((u64) hi) << 32;
> +	ns |= lo;
> +	return ns;
> +}
> +
> +static void tmr_cnt_write(struct etsects *etsects, u64 ns)
> +{
> +	u32 hi = ns >> 32;
> +	u32 lo = ns & 0xffffffff;
> +
> +	gfar_write(&etsects->regs->tmr_cnt_l, lo);
> +	gfar_write(&etsects->regs->tmr_cnt_h, hi);
> +}
> +
> +static void set_alarm(struct etsects *etsects)
> +{
> +	u64 ns;
> +	u32 lo, hi;
> +
> +	ns = tmr_cnt_read(etsects) + 1500000000ULL;
> +	ns = div_u64(ns, 1000000000UL) * 1000000000ULL;
> +	ns -= etsects->tclk_period;
> +	hi = ns >> 32;
> +	lo = ns & 0xffffffff;
> +	gfar_write(&etsects->regs->tmr_alarm1_l, lo);
> +	gfar_write(&etsects->regs->tmr_alarm1_h, hi);
> +}
> +
> +static void set_fipers(struct etsects *etsects)
> +{
> +	u32 tmr_ctrl = gfar_read(&etsects->regs->tmr_ctrl);
> +
> +	gfar_write(&etsects->regs->tmr_ctrl,   tmr_ctrl & (~TE));
> +	gfar_write(&etsects->regs->tmr_prsc,   etsects->tmr_prsc);
> +	gfar_write(&etsects->regs->tmr_fiper1, etsects->tmr_fiper1);
> +	gfar_write(&etsects->regs->tmr_fiper2, etsects->tmr_fiper2);
> +	set_alarm(etsects);
> +	gfar_write(&etsects->regs->tmr_ctrl,   tmr_ctrl|TE);
> +}
> +
> +/*
> + * Interrupt service routine
> + */
> +
> +static irqreturn_t isr(int irq, void *priv)
> +{
> +	struct etsects *etsects = priv;
> +	struct ptp_clock_event event;
> +	u64 ns;
> +	u32 ack = 0, lo, hi, mask, val;
> +
> +	val = gfar_read(&etsects->regs->tmr_tevent);
> +
> +	if (val & ETS1) {
> +		ack |= ETS1;
> +		hi = gfar_read(&etsects->regs->tmr_etts1_h);
> +		lo = gfar_read(&etsects->regs->tmr_etts1_l);
> +		event.type = PTP_CLOCK_EXTTS;
> +		event.index = 0;
> +		event.timestamp = ((u64) hi) << 32;
> +		event.timestamp |= lo;
> +		ptp_clock_event(etsects->clock, &event);
> +	}
> +
> +	if (val & ETS2) {
> +		ack |= ETS2;
> +		hi = gfar_read(&etsects->regs->tmr_etts2_h);
> +		lo = gfar_read(&etsects->regs->tmr_etts2_l);
> +		event.type = PTP_CLOCK_EXTTS;
> +		event.index = 1;
> +		event.timestamp = ((u64) hi) << 32;
> +		event.timestamp |= lo;
> +		ptp_clock_event(etsects->clock, &event);
> +	}
> +
> +	if (val & ALM2) {
> +		ack |= ALM2;
> +		if (etsects->alarm_value) {
> +			event.type = PTP_CLOCK_ALARM;
> +			event.index = 0;
> +			event.timestamp = etsects->alarm_value;
> +			ptp_clock_event(etsects->clock, &event);
> +		}
> +		if (etsects->alarm_interval) {
> +			ns = etsects->alarm_value + etsects->alarm_interval;
> +			hi = ns >> 32;
> +			lo = ns & 0xffffffff;
> +			spin_lock(&register_lock);
> +			gfar_write(&etsects->regs->tmr_alarm2_l, lo);
> +			gfar_write(&etsects->regs->tmr_alarm2_h, hi);
> +			spin_unlock(&register_lock);
> +			etsects->alarm_value = ns;
> +		} else {
> +			gfar_write(&etsects->regs->tmr_tevent, ALM2);
> +			spin_lock(&register_lock);
> +			mask = gfar_read(&etsects->regs->tmr_temask);
> +			mask &= ~ALM2EN;
> +			gfar_write(&etsects->regs->tmr_temask, mask);
> +			spin_unlock(&register_lock);
> +			etsects->alarm_value = 0;
> +			etsects->alarm_interval = 0;
> +		}
> +	}
> +
> +	if (val & PP1) {
> +		ack |= PP1;
> +		event.type = PTP_CLOCK_PPS;
> +		ptp_clock_event(etsects->clock, &event);
> +	}
> +
> +	if (ack) {
> +		gfar_write(&etsects->regs->tmr_tevent, ack);
> +		return IRQ_HANDLED;
> +	} else
> +		return IRQ_NONE;
> +}
> +
> +/*
> + * PTP clock operations
> + */
> +
> +static int ptp_gianfar_adjfreq(struct ptp_clock_info *ptp, s32 ppb)
> +{
> +	u64 adj;
> +	u32 diff, tmr_add;
> +	int neg_adj = 0;
> +	struct etsects *etsects = container_of(ptp, struct etsects, caps);
> +
> +	if (ppb < 0) {
> +		neg_adj = 1;
> +		ppb = -ppb;
> +	}
> +	tmr_add = etsects->tmr_add;
> +	adj = tmr_add;
> +	adj *= ppb;
> +	diff = div_u64(adj, 1000000000ULL);
> +
> +	tmr_add = neg_adj ? tmr_add - diff : tmr_add + diff;
> +
> +	gfar_write(&etsects->regs->tmr_add, tmr_add);
> +
> +	return 0;
> +}
> +
> +static int ptp_gianfar_adjtime(struct ptp_clock_info *ptp, s64 delta)
> +{
> +	s64 now;
> +	unsigned long flags;
> +	struct etsects *etsects = container_of(ptp, struct etsects, caps);
> +
> +	spin_lock_irqsave(&register_lock, flags);
> +
> +	now = tmr_cnt_read(etsects);
> +	now += delta;
> +	tmr_cnt_write(etsects, now);
> +
> +	spin_unlock_irqrestore(&register_lock, flags);
> +
> +	set_fipers(etsects);
> +
> +	return 0;
> +}
> +
> +static int ptp_gianfar_gettime(struct ptp_clock_info *ptp, struct timespec *ts)
> +{
> +	u64 ns;
> +	u32 remainder;
> +	unsigned long flags;
> +	struct etsects *etsects = container_of(ptp, struct etsects, caps);
> +
> +	spin_lock_irqsave(&register_lock, flags);
> +
> +	ns = tmr_cnt_read(etsects);
> +
> +	spin_unlock_irqrestore(&register_lock, flags);
> +
> +	ts->tv_sec = div_u64_rem(ns, 1000000000, &remainder);
> +	ts->tv_nsec = remainder;
> +	return 0;
> +}
> +
> +static int ptp_gianfar_settime(struct ptp_clock_info *ptp,
> +			       const struct timespec *ts)
> +{
> +	u64 ns;
> +	unsigned long flags;
> +	struct etsects *etsects = container_of(ptp, struct etsects, caps);
> +
> +	ns = ts->tv_sec * 1000000000ULL;
> +	ns += ts->tv_nsec;
> +
> +	spin_lock_irqsave(&register_lock, flags);
> +
> +	tmr_cnt_write(etsects, ns);
> +	set_fipers(etsects);
> +
> +	spin_unlock_irqrestore(&register_lock, flags);
> +
> +	return 0;
> +}
> +
> +static int ptp_gianfar_enable(struct ptp_clock_info *ptp,
> +			      struct ptp_clock_request *rq, int on)
> +{
> +	struct etsects *etsects = container_of(ptp, struct etsects, caps);
> +	unsigned long flags;
> +	u32 bit, mask;
> +
> +	switch (rq->type) {
> +	case PTP_CLK_REQ_EXTTS:
> +		switch (rq->extts.index) {
> +		case 0:
> +			bit = ETS1EN;
> +			break;
> +		case 1:
> +			bit = ETS2EN;
> +			break;
> +		default:
> +			return -EINVAL;
> +		}
> +		spin_lock_irqsave(&register_lock, flags);
> +		mask = gfar_read(&etsects->regs->tmr_temask);
> +		if (on)
> +			mask |= bit;
> +		else
> +			mask &= ~bit;
> +		gfar_write(&etsects->regs->tmr_temask, mask);
> +		spin_unlock_irqrestore(&register_lock, flags);
> +		return 0;
> +
> +	case PTP_CLK_REQ_PPS:
> +		spin_lock_irqsave(&register_lock, flags);
> +		mask = gfar_read(&etsects->regs->tmr_temask);
> +		if (on)
> +			mask |= PP1EN;
> +		else
> +			mask &= ~PP1EN;
> +		gfar_write(&etsects->regs->tmr_temask, mask);
> +		spin_unlock_irqrestore(&register_lock, flags);
> +		return 0;
> +
> +	default:
> +		break;
> +	}
> +
> +	return -EOPNOTSUPP;
> +}
> +
> +static struct ptp_clock_info ptp_gianfar_caps = {
> +	.owner		= THIS_MODULE,
> +	.name		= "gianfar clock",
> +	.max_adj	= 512000,
> +	.n_alarm	= N_ALARM,
> +	.n_ext_ts	= N_EXT_TS,
> +	.n_per_out	= 0,
> +	.pps		= 1,
> +	.adjfreq	= ptp_gianfar_adjfreq,
> +	.adjtime	= ptp_gianfar_adjtime,
> +	.gettime	= ptp_gianfar_gettime,
> +	.settime	= ptp_gianfar_settime,
> +	.enable		= ptp_gianfar_enable,
> +};
> +
> +/* OF device tree */
> +
> +static int get_of_u32(struct device_node *node, char *str, u32 *val)
> +{
> +	int plen;
> +	const u32 *prop = of_get_property(node, str, &plen);
> +
> +	if (!prop || plen != sizeof(*prop))
> +		return -1;
> +	*val = *prop;
> +	return 0;
> +}
> +
> +static int gianfar_ptp_probe(struct platform_device *dev,
> +			     const struct of_device_id *match)
> +{
> +	struct device_node *node = dev->dev.of_node;
> +	struct etsects *etsects = &the_clock;
> +	struct timespec now;
> +	u32 tmr_ctrl;
> +
> +	etsects->caps = ptp_gianfar_caps;
> +
> +	if (get_of_u32(node, "tclk-period", &etsects->tclk_period) ||
> +	    get_of_u32(node, "tmr-prsc", &etsects->tmr_prsc) ||
> +	    get_of_u32(node, "tmr-add", &etsects->tmr_add) ||
> +	    get_of_u32(node, "cksel", &etsects->cksel) ||
> +	    get_of_u32(node, "tmr-fiper1", &etsects->tmr_fiper1) ||
> +	    get_of_u32(node, "tmr-fiper2", &etsects->tmr_fiper2) ||
> +	    get_of_u32(node, "max-adj", &etsects->caps.max_adj)) {
> +		pr_err("device tree node missing required elements\n");
> +		return -ENODEV;
> +	}
> +
> +	etsects->irq = irq_of_parse_and_map(node, 0);
> +
> +	if (etsects->irq == NO_IRQ) {
> +		pr_err("irq not in device tree\n");
> +		return -ENODEV;
> +	}
> +	if (request_irq(etsects->irq, isr, 0, DRIVER, etsects)) {
> +		pr_err("request_irq failed\n");
> +		return -ENODEV;
> +	}
> +	etsects->regs = of_iomap(node, 0);
> +	if (!etsects->regs) {
> +		pr_err("of_iomap ptp registers failed\n");
> +		return -EINVAL;
> +	}
> +	getnstimeofday(&now);
> +	ptp_gianfar_settime(&etsects->caps, &now);
> +
> +	tmr_ctrl =
> +	  (etsects->tclk_period & TCLK_PERIOD_MASK) << TCLK_PERIOD_SHIFT |
> +	  (etsects->cksel & CKSEL_MASK) << CKSEL_SHIFT;
> +
> +	gfar_write(&etsects->regs->tmr_ctrl,   tmr_ctrl);
> +	gfar_write(&etsects->regs->tmr_add,    etsects->tmr_add);
> +	gfar_write(&etsects->regs->tmr_prsc,   etsects->tmr_prsc);
> +	gfar_write(&etsects->regs->tmr_fiper1, etsects->tmr_fiper1);
> +	gfar_write(&etsects->regs->tmr_fiper2, etsects->tmr_fiper2);
> +	set_alarm(etsects);
> +	gfar_write(&etsects->regs->tmr_ctrl,   tmr_ctrl|FS|RTPE|TE);
> +
> +	etsects->clock = ptp_clock_register(&etsects->caps);
> +
> +	return IS_ERR(etsects->clock) ? PTR_ERR(etsects->clock) : 0;
> +}
> +
> +static int gianfar_ptp_remove(struct platform_device *dev)
> +{
> +	gfar_write(&the_clock.regs->tmr_temask, 0);
> +	gfar_write(&the_clock.regs->tmr_ctrl,   0);
> +
> +	ptp_clock_unregister(the_clock.clock);
> +	free_irq(the_clock.irq, &the_clock);
> +	iounmap(the_clock.regs);
> +
> +	return 0;
> +}
> +
> +static struct of_device_id match_table[] = {
> +	{ .compatible = "fsl,etsec-ptp" },
> +	{},
> +};
> +
> +static struct of_platform_driver gianfar_ptp_driver = {
> +	.driver = {
> +		.name		= "gianfar_ptp",
> +		.of_match_table	= match_table,
> +		.owner		= THIS_MODULE,
> +	},
> +	.probe       = gianfar_ptp_probe,
> +	.remove      = gianfar_ptp_remove,
> +};
> +
> +/* module operations */
> +
> +static int __init ptp_gianfar_init(void)
> +{
> +	return of_register_platform_driver(&gianfar_ptp_driver);
> +}
> +
> +module_init(ptp_gianfar_init);
> +
> +static void __exit ptp_gianfar_exit(void)
> +{
> +	of_unregister_platform_driver(&gianfar_ptp_driver);
> +}
> +
> +module_exit(ptp_gianfar_exit);
> +
> +MODULE_AUTHOR("Richard Cochran <richard.cochran-3mrvs1K0uXizZXS1Dc/lvw@public.gmane.org>");
> +MODULE_DESCRIPTION("PTP clock using the eTSEC");
> +MODULE_LICENSE("GPL");
> diff --git a/drivers/net/gianfar_ptp_reg.h b/drivers/net/gianfar_ptp_reg.h
> new file mode 100644
> index 0000000..95e171f
> --- /dev/null
> +++ b/drivers/net/gianfar_ptp_reg.h
> @@ -0,0 +1,113 @@
> +/* gianfar_ptp_reg.h
> + * Generated by regen.tcl on Thu May 13 01:38:57 PM CEST 2010
> + *
> + * PTP 1588 clock using the gianfar eTSEC
> + *
> + * Copyright (C) 2010 OMICRON electronics GmbH
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License, or
> + *  (at your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + *  GNU General Public License for more details.
> + *
> + *  You should have received a copy of the GNU General Public License
> + *  along with this program; if not, write to the Free Software
> + *  Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
> + */
> +#ifndef _GIANFAR_PTP_REG_H_
> +#define _GIANFAR_PTP_REG_H_
> +
> +struct gianfar_ptp_registers {
> +	u32 tmr_ctrl;     /* Timer control register */
> +	u32 tmr_tevent;   /* Timestamp event register */
> +	u32 tmr_temask;   /* Timer event mask register */
> +	u32 tmr_pevent;   /* Timestamp event register */
> +	u32 tmr_pemask;   /* Timer event mask register */
> +	u32 tmr_stat;     /* Timestamp status register */
> +	u32 tmr_cnt_h;    /* Timer counter high register */
> +	u32 tmr_cnt_l;    /* Timer counter low register */
> +	u32 tmr_add;      /* Timer drift compensation addend register */
> +	u32 tmr_acc;      /* Timer accumulator register */
> +	u32 tmr_prsc;     /* Timer prescale */
> +	u8  res1[4];
> +	u32 tmroff_h;     /* Timer offset high */
> +	u32 tmroff_l;     /* Timer offset low */
> +	u8  res2[8];
> +	u32 tmr_alarm1_h; /* Timer alarm 1 high register */
> +	u32 tmr_alarm1_l; /* Timer alarm 1 high register */
> +	u32 tmr_alarm2_h; /* Timer alarm 2 high register */
> +	u32 tmr_alarm2_l; /* Timer alarm 2 high register */
> +	u8  res3[48];
> +	u32 tmr_fiper1;   /* Timer fixed period interval */
> +	u32 tmr_fiper2;   /* Timer fixed period interval */
> +	u32 tmr_fiper3;   /* Timer fixed period interval */
> +	u8  res4[20];
> +	u32 tmr_etts1_h;  /* Timestamp of general purpose external trigger */
> +	u32 tmr_etts1_l;  /* Timestamp of general purpose external trigger */
> +	u32 tmr_etts2_h;  /* Timestamp of general purpose external trigger */
> +	u32 tmr_etts2_l;  /* Timestamp of general purpose external trigger */
> +};
> +
> +/* Bit definitions for the TMR_CTRL register */
> +#define ALM1P                 (1<<31) /* Alarm1 output polarity */
> +#define ALM2P                 (1<<30) /* Alarm2 output polarity */
> +#define FS                    (1<<28) /* FIPER start indication */
> +#define PP1L                  (1<<27) /* Fiper1 pulse loopback mode enabled. */
> +#define PP2L                  (1<<26) /* Fiper2 pulse loopback mode enabled. */
> +#define TCLK_PERIOD_SHIFT     (16) /* 1588 timer reference clock period. */
> +#define TCLK_PERIOD_MASK      (0x3ff)
> +#define RTPE                  (1<<15) /* Record Tx Timestamp to PAL Enable. */
> +#define FRD                   (1<<14) /* FIPER Realignment Disable */
> +#define ESFDP                 (1<<11) /* External Tx/Rx SFD Polarity. */
> +#define ESFDE                 (1<<10) /* External Tx/Rx SFD Enable. */
> +#define ETEP2                 (1<<9) /* External trigger 2 edge polarity */
> +#define ETEP1                 (1<<8) /* External trigger 1 edge polarity */
> +#define COPH                  (1<<7) /* Generated clock (TSEC_1588_GCLK) output phase. */
> +#define CIPH                  (1<<6) /* External oscillator input clock phase. */
> +#define TMSR                  (1<<5) /* Timer soft reset. When enabled, it resets all the timer registers and state machines. */
> +#define BYP                   (1<<3) /* Bypass drift compensated clock */
> +#define TE                    (1<<2) /* 1588 timer enable. If not enabled, all the timer registers and state machines are disabled. */
> +#define CKSEL_SHIFT           (0) /* 1588 Timer reference clock source select. */
> +#define CKSEL_MASK            (0x3)
> +
> +/* Bit definitions for the TMR_TEVENT register */
> +#define ETS2                  (1<<25) /* External trigger 2 timestamp sampled */
> +#define ETS1                  (1<<24) /* External trigger 1 timestamp sampled */
> +#define ALM2                  (1<<17) /* Current time equaled alarm time register 2 */
> +#define ALM1                  (1<<16) /* Current time equaled alarm time register 1 */
> +#define PP1                   (1<<7) /* Indicates that a periodic pulse has been generated based on FIPER1 register */
> +#define PP2                   (1<<6) /* Indicates that a periodic pulse has been generated based on FIPER2 register */
> +#define PP3                   (1<<5) /* Indicates that a periodic pulse has been generated based on FIPER3 register */
> +
> +/* Bit definitions for the TMR_TEMASK register */
> +#define ETS2EN                (1<<25) /* External trigger 2 timestamp sample event enable */
> +#define ETS1EN                (1<<24) /* External trigger 1 timestamp sample event enable */
> +#define ALM2EN                (1<<17) /* Timer ALM2 event enable */
> +#define ALM1EN                (1<<16) /* Timer ALM1 event enable */
> +#define PP1EN                 (1<<7) /* Periodic pulse event 1 enable */
> +#define PP2EN                 (1<<6) /* Periodic pulse event 2 enable */
> +
> +/* Bit definitions for the TMR_PEVENT register */
> +#define TXP2                  (1<<9) /* Indicates that a PTP frame has been transmitted and its timestamp is stored in TXTS2 register */
> +#define TXP1                  (1<<8) /* Indicates that a PTP frame has been transmitted and its timestamp is stored in TXTS1 register */
> +#define RXP                   (1<<0) /* Indicates that a PTP frame has been received */
> +
> +/* Bit definitions for the TMR_PEMASK register */
> +#define TXP2EN                (1<<9) /* Transmit PTP packet event 2 enable */
> +#define TXP1EN                (1<<8) /* Transmit PTP packet event 1 enable */
> +#define RXPEN                 (1<<0) /* Receive PTP packet event enable */
> +
> +/* Bit definitions for the TMR_STAT register */
> +#define STAT_VEC_SHIFT        (0) /* Timer general purpose status vector */
> +#define STAT_VEC_MASK         (0x3f)
> +
> +/* Bit definitions for the TMR_PRSC register */
> +#define PRSC_OCK_SHIFT        (0) /* Output clock division/prescale factor. */
> +#define PRSC_OCK_MASK         (0xffff)
> +
> +#endif
> diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
> index 17be208..a4df298 100644
> --- a/drivers/ptp/Kconfig
> +++ b/drivers/ptp/Kconfig
> @@ -24,4 +24,17 @@ config PTP_1588_CLOCK
>  	  To compile this driver as a module, choose M here: the module
>  	  will be called ptp.
> 
> +config PTP_1588_CLOCK_GIANFAR
> +	tristate "Freescale eTSEC as PTP clock"
> +	depends on PTP_1588_CLOCK
> +	depends on GIANFAR
> +	help
> +	  This driver adds support for using the eTSEC as a PTP
> +	  clock. This clock is only useful if your PTP programs are
> +	  getting hardware time stamps on the PTP Ethernet packets
> +	  using the SO_TIMESTAMPING API.
> +
> +	  To compile this driver as a module, choose M here: the module
> +	  will be called gianfar_ptp.
> +
>  endmenu

^ permalink raw reply

* Re: [PATCH V10 12/15] ptp: Added a brand new class driver for ptp clocks.
From: John Stultz @ 2011-02-02  2:00 UTC (permalink / raw)
  To: Richard Cochran
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	Alan Cox, Arnd Bergmann, Christoph Lameter, David Miller,
	Krzysztof Halasa, Peter Zijlstra, Rodolfo Giometti,
	Thomas Gleixner, Benjamin Herrenschmidt, H. Peter Anvin,
	Ingo Molnar, Mike Frysinger, Paul Mackerras, Russell King
In-Reply-To: <ada07b1d09b1e8eb717c5203e64856d2eaad7456.1296124770.git.richard.cochran-3mrvs1K0uXizZXS1Dc/lvw@public.gmane.org>

On Thu, 2011-01-27 at 11:59 +0100, Richard Cochran wrote:
> This patch adds an infrastructure for hardware clocks that implement
> IEEE 1588, the Precision Time Protocol (PTP). A class driver offers a
> registration method to particular hardware clock drivers. Each clock is
> presented as a standard POSIX clock.
> 
> The ancillary clock features are exposed in two different ways, via
> the sysfs and by a character device.
> 
> Signed-off-by: Richard Cochran <richard.cochran-3mrvs1K0uXizZXS1Dc/lvw@public.gmane.org>

Few issues below.


> diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
> new file mode 100644
> index 0000000..17be208
> --- /dev/null
> +++ b/drivers/ptp/Kconfig
> @@ -0,0 +1,27 @@
> +#
> +# PTP clock support configuration
> +#
> +
> +menu "PTP clock support"
> +
> +config PTP_1588_CLOCK
> +	tristate "PTP clock support"
> +	depends on EXPERIMENTAL
> +	depends on PPS
> +	help
> +	  The IEEE 1588 standard defines a method to precisely
> +	  synchronize distributed clocks over Ethernet networks. The
> +	  standard defines a Precision Time Protocol (PTP), which can
> +	  be used to achieve synchronization within a few dozen
> +	  microseconds. In addition, with the help of special hardware
> +	  time stamping units, it can be possible to achieve
> +	  synchronization to within a few hundred nanoseconds.
> +
> +	  This driver adds support for PTP clocks as character
> +	  devices. If you want to use a PTP clock, then you should
> +	  also enable at least one clock driver as well.
> +
> +	  To compile this driver as a module, choose M here: the module
> +	  will be called ptp.
> +
> +endmenu

You may want to tweak the kconfig a bit here. If I don't have pps
enabled, if I go into the "PTP clock support" page, I get an empty
screen.

Similarly, its not very discoverable to figure out what you need to
enable to get the driver options to show up, as they depend the drivers
enabled in the networking device section.




> diff --git a/drivers/ptp/ptp_clock.c b/drivers/ptp/ptp_clock.c
> new file mode 100644
> index 0000000..cd2b5e5
> --- /dev/null
> +++ b/drivers/ptp/ptp_clock.c
> @@ -0,0 +1,315 @@
> +/*
> + * PTP 1588 clock support
> + *
> + * Copyright (C) 2010 OMICRON electronics GmbH
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License, or
> + *  (at your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + *  GNU General Public License for more details.
> + *
> + *  You should have received a copy of the GNU General Public License
> + *  along with this program; if not, write to the Free Software
> + *  Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
> + */
> +#include <linux/bitops.h>
> +#include <linux/device.h>
> +#include <linux/err.h>
> +#include <linux/init.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/posix-clock.h>
> +#include <linux/pps_kernel.h>
> +#include <linux/slab.h>
> +#include <linux/syscalls.h>
> +#include <linux/uaccess.h>
> +
> +#include "ptp_private.h"
> +
> +#define PTP_MAX_ALARMS 4
> +#define PTP_MAX_CLOCKS (MAX_CLOCKS/2)

Why MAX_CLOCKS/2 ? Should this scale as MAX_CLOCKS is increased?
Or do you really just mean 8?

> +#define PTP_PPS_DEFAULTS (PPS_CAPTUREASSERT | PPS_OFFSETASSERT)
> +#define PTP_PPS_EVENT PPS_CAPTUREASSERT
> +#define PTP_PPS_MODE (PTP_PPS_DEFAULTS | PPS_CANWAIT | PPS_TSFMT_TSPEC)
> +
> +/* private globals */
> +
> +static dev_t ptp_devt;
> +static struct class *ptp_class;
> +
> +static DECLARE_BITMAP(ptp_clocks_map, PTP_MAX_CLOCKS);
> +static DEFINE_MUTEX(ptp_clocks_mutex); /* protects 'ptp_clocks_map' */
> +
> +/* time stamp event queue operations */
> +
> +static inline int queue_free(struct timestamp_event_queue *q)
> +{
> +	return PTP_MAX_TIMESTAMPS - queue_cnt(q) - 1;
> +}
> +
> +static void enqueue_external_timestamp(struct timestamp_event_queue *queue,
> +				       struct ptp_clock_event *src)
> +{
> +	struct ptp_extts_event *dst;
> +	u32 remainder;
> +
> +	dst = &queue->buf[queue->tail];
> +
> +	dst->index = src->index;
> +	dst->t.sec = div_u64_rem(src->timestamp, 1000000000, &remainder);
> +	dst->t.nsec = remainder;
> +
> +	if (!queue_free(queue))
> +		queue->overflow++;
> +
> +	queue->tail = (queue->tail + 1) % PTP_MAX_TIMESTAMPS;
> +}

So what is serializing access to the timestamp_event_queue here? I don't
see any usage of tsevq_mux by the callers. Am I missing it? It looks
like its called from interrupt context, so do you really need a spinlock
and not a mutex here?


> diff --git a/drivers/ptp/ptp_private.h b/drivers/ptp/ptp_private.h
> new file mode 100644
> index 0000000..941de61
> --- /dev/null
> +++ b/drivers/ptp/ptp_private.h
> @@ -0,0 +1,85 @@
> +/*
> + * PTP 1588 clock support - private declarations for the core module.
> + *
> + * Copyright (C) 2010 OMICRON electronics GmbH
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License, or
> + *  (at your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + *  GNU General Public License for more details.
> + *
> + *  You should have received a copy of the GNU General Public License
> + *  along with this program; if not, write to the Free Software
> + *  Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
> + */
> +#ifndef _PTP_PRIVATE_H_
> +#define _PTP_PRIVATE_H_
> +
> +#include <linux/cdev.h>
> +#include <linux/device.h>
> +#include <linux/mutex.h>
> +#include <linux/posix-clock.h>
> +#include <linux/ptp_clock.h>
> +#include <linux/ptp_clock_kernel.h>
> +#include <linux/time.h>
> +
> +#define PTP_MAX_TIMESTAMPS 128
> +
> +struct timestamp_event_queue {
> +	struct ptp_extts_event buf[PTP_MAX_TIMESTAMPS];
> +	int head;
> +	int tail;
> +	int overflow;
> +};
> +
> +struct ptp_clock {
> +	struct posix_clock clock;
> +	struct device *dev;
> +	struct ptp_clock_info *info;
> +	dev_t devid;
> +	int index; /* index into clocks.map */
> +	struct pps_device *pps_source;
> +	struct timestamp_event_queue tsevq; /* simple fifo for time stamps */
> +	struct mutex tsevq_mux; /* one process at a time reading the fifo */
> +	wait_queue_head_t tsev_wq;
> +};
> +
> +static inline int queue_cnt(struct timestamp_event_queue *q)
> +{
> +	int cnt = q->tail - q->head;
> +	return cnt < 0 ? PTP_MAX_TIMESTAMPS + cnt : cnt;
> +}

This probably needs a comment as to its locking rules. Something like
"Callers must hold tsevq_mux."



> diff --git a/include/linux/ptp_clock.h b/include/linux/ptp_clock.h
> new file mode 100644
> index 0000000..b4ef2cb
> --- /dev/null
> +++ b/include/linux/ptp_clock.h
> @@ -0,0 +1,79 @@
> +/*
> + * PTP 1588 clock support - user space interface
> + *
> + * Copyright (C) 2010 OMICRON electronics GmbH
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License, or
> + *  (at your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + *  GNU General Public License for more details.
> + *
> + *  You should have received a copy of the GNU General Public License
> + *  along with this program; if not, write to the Free Software
> + *  Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
> + */
> +
> +#ifndef _PTP_CLOCK_H_
> +#define _PTP_CLOCK_H_
> +
> +#include <linux/ioctl.h>
> +#include <linux/types.h>
> +
> +/* PTP_xxx bits, for the flags field within the request structures. */
> +#define PTP_ENABLE_FEATURE (1<<0)
> +#define PTP_RISING_EDGE    (1<<1)
> +#define PTP_FALLING_EDGE   (1<<2)
> +
> +/*
> + * struct ptp_clock_time - represents a time value
> + *
> + * The sign of the seconds field applies to the whole value. The
> + * nanoseconds field is always unsigned. The reserved field is
> + * included for sub-nanosecond resolution, should the demand for
> + * this ever appear.
> + *
> + */
> +struct ptp_clock_time {
> +	__s64 sec;  /* seconds */
> +	__u32 nsec; /* nanoseconds */
> +	__u32 reserved;
> +};
> +
> +struct ptp_clock_caps {
> +	int max_adj;   /* Maximum frequency adjustment in parts per billon. */
> +	int n_alarm;   /* Number of programmable alarms. */
> +	int n_ext_ts;  /* Number of external time stamp channels. */
> +	int n_per_out; /* Number of programmable periodic signals. */
> +	int pps;       /* Whether the clock supports a PPS callback. */
> +};
> +
> +struct ptp_extts_request {
> +	unsigned int index; /* Which channel to configure. */
> +	unsigned int flags; /* Bit field for PTP_xxx flags. */
> +};
> +
> +struct ptp_perout_request {
> +	struct ptp_clock_time start;  /* Absolute start time. */
> +	struct ptp_clock_time period; /* Desired period, zero means disable. */
> +	unsigned int index;           /* Which channel to configure. */
> +	unsigned int flags;           /* Reserved for future use. */
> +};

Since these are all new API/ABI structures, would it be wise to pad
these out a bit more? You've got a couple of reserved fields, which is
good, but if you think we're going to expand this at all, we may want to
have a bit more wiggle room. The timex structure had something like 12
unused ints (which came in handy when the tai field was added).

Not sure what the wider opinion is on this though.


> +#define PTP_CLK_MAGIC '='
> +
> +#define PTP_CLOCK_GETCAPS  _IOR(PTP_CLK_MAGIC, 1, struct ptp_clock_caps)
> +#define PTP_EXTTS_REQUEST  _IOW(PTP_CLK_MAGIC, 2, struct ptp_extts_request)
> +#define PTP_PEROUT_REQUEST _IOW(PTP_CLK_MAGIC, 3, struct ptp_perout_request)
> +#define PTP_ENABLE_PPS     _IOW(PTP_CLK_MAGIC, 4, int)
> +
> +struct ptp_extts_event {
> +	struct ptp_clock_time t; /* Time event occured. */
> +	unsigned int index;      /* Which channel produced the event. */
> +};

Same padding suggestion for this as well.


thanks
-john

^ permalink raw reply

* Re: [PATCH] ipv4: Remove fib_hash.
From: Eric Dumazet @ 2011-02-02  0:48 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20110201.151941.48509941.davem@davemloft.net>

Le mardi 01 février 2011 à 15:19 -0800, David Miller a écrit :
> The time has finally come to remove the hash based routing table
> implementation in ipv4.
> 
> FIB Trie is mature, well tested, and I've done an audit of it's code
> to confirm that it implements insert, delete, and lookup with the same
> identical semantics as fib_hash did.
> 
> If there are any semantic differences found in fib_trie, we should
> simply fix them.
> 
> I've placed the trie statistic config option under advanced router
> configuration.
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>
> ---

Hmm... I know having to maintain two implementations is time consuming,
but I know fib_trie is bigger :

# size net/ipv4/fib_*.o
   text	   data	    bss	    dec	    hex	filename
   7252	    120	      0	   7372	   1ccc	net/ipv4/fib_frontend.o
   7279	     16	      4	   7299	   1c83	net/ipv4/fib_hash.o
   1479	      0	      0	   1479	    5c7	net/ipv4/fib_rules.o
   7885	      0	   2080	   9965	   26ed	net/ipv4/fib_semantics.o
  16222	     16	     16	  16254	   3f7e	net/ipv4/fib_trie.o

In my tests, I know that fib_trie is more expensive for typical routing
tables for hosts (no more than a dozen or entries), in latencies
results, mostly because of icache misses, but also dcache ones.

Thanks



^ permalink raw reply

* Re: [PATCH net-2.6 2/2] be2net: remove netif_stop_queue being called before register_netdev.
From: David Miller @ 2011-02-01 23:42 UTC (permalink / raw)
  To: ajit.khaparde; +Cc: netdev
In-Reply-To: <20110131232755.GA4691@akhaparde-VBox>

From: Ajit Khaparde <ajit.khaparde@emulex.com>
Date: Mon, 31 Jan 2011 17:27:55 -0600

> It is illegal to call netif_stop_queue before register_netdev.
> 
> Signed-off-by: Ajit Khaparde <ajit.khaparde@emulex.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-2.6 1/2] be2net: fix a crash seen during insmod/rmmod test
From: David Miller @ 2011-02-01 23:41 UTC (permalink / raw)
  To: ajit.khaparde; +Cc: netdev
In-Reply-To: <20110131232704.GA4635@akhaparde-VBox>

From: Ajit Khaparde <ajit.khaparde@emulex.com>
Date: Mon, 31 Jan 2011 17:27:04 -0600

> While running insmod/rmood in a loop, an unnecessary netif_stop_queue
> causes the system to crash. Remove the netif_stop_queue call
> and netif_start_queue in the link status update path.
> 
> Signed-off-by: Ajit Khaparde <ajit.khaparde@emulex.com>

Applied.

^ permalink raw reply

* [PATCH] ipv4: Rename fib_hash_* locals in fib_semantics.c
From: David Miller @ 2011-02-01 23:38 UTC (permalink / raw)
  To: netdev


To avoid confusion with the recently deleted fib_hash.c
code, use "fib_info_hash_*" instead of plain "fib_hash_*".

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/fib_semantics.c |   40 ++++++++++++++++++++--------------------
 1 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index b15857d..146bd82 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -49,7 +49,7 @@
 static DEFINE_SPINLOCK(fib_info_lock);
 static struct hlist_head *fib_info_hash;
 static struct hlist_head *fib_info_laddrhash;
-static unsigned int fib_hash_size;
+static unsigned int fib_info_hash_size;
 static unsigned int fib_info_cnt;
 
 #define DEVINDEX_HASHBITS 8
@@ -223,7 +223,7 @@ static inline unsigned int fib_devindex_hashfn(unsigned int val)
 
 static inline unsigned int fib_info_hashfn(const struct fib_info *fi)
 {
-	unsigned int mask = (fib_hash_size - 1);
+	unsigned int mask = (fib_info_hash_size - 1);
 	unsigned int val = fi->fib_nhs;
 
 	val ^= fi->fib_protocol;
@@ -615,14 +615,14 @@ out:
 
 static inline unsigned int fib_laddr_hashfn(__be32 val)
 {
-	unsigned int mask = (fib_hash_size - 1);
+	unsigned int mask = (fib_info_hash_size - 1);
 
 	return ((__force u32)val ^
 		((__force u32)val >> 7) ^
 		((__force u32)val >> 14)) & mask;
 }
 
-static struct hlist_head *fib_hash_alloc(int bytes)
+static struct hlist_head *fib_info_hash_alloc(int bytes)
 {
 	if (bytes <= PAGE_SIZE)
 		return kzalloc(bytes, GFP_KERNEL);
@@ -632,7 +632,7 @@ static struct hlist_head *fib_hash_alloc(int bytes)
 					 get_order(bytes));
 }
 
-static void fib_hash_free(struct hlist_head *hash, int bytes)
+static void fib_info_hash_free(struct hlist_head *hash, int bytes)
 {
 	if (!hash)
 		return;
@@ -643,18 +643,18 @@ static void fib_hash_free(struct hlist_head *hash, int bytes)
 		free_pages((unsigned long) hash, get_order(bytes));
 }
 
-static void fib_hash_move(struct hlist_head *new_info_hash,
-			  struct hlist_head *new_laddrhash,
-			  unsigned int new_size)
+static void fib_info_hash_move(struct hlist_head *new_info_hash,
+			       struct hlist_head *new_laddrhash,
+			       unsigned int new_size)
 {
 	struct hlist_head *old_info_hash, *old_laddrhash;
-	unsigned int old_size = fib_hash_size;
+	unsigned int old_size = fib_info_hash_size;
 	unsigned int i, bytes;
 
 	spin_lock_bh(&fib_info_lock);
 	old_info_hash = fib_info_hash;
 	old_laddrhash = fib_info_laddrhash;
-	fib_hash_size = new_size;
+	fib_info_hash_size = new_size;
 
 	for (i = 0; i < old_size; i++) {
 		struct hlist_head *head = &fib_info_hash[i];
@@ -695,8 +695,8 @@ static void fib_hash_move(struct hlist_head *new_info_hash,
 	spin_unlock_bh(&fib_info_lock);
 
 	bytes = old_size * sizeof(struct hlist_head *);
-	fib_hash_free(old_info_hash, bytes);
-	fib_hash_free(old_laddrhash, bytes);
+	fib_info_hash_free(old_info_hash, bytes);
+	fib_info_hash_free(old_laddrhash, bytes);
 }
 
 struct fib_info *fib_create_info(struct fib_config *cfg)
@@ -720,8 +720,8 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
 #endif
 
 	err = -ENOBUFS;
-	if (fib_info_cnt >= fib_hash_size) {
-		unsigned int new_size = fib_hash_size << 1;
+	if (fib_info_cnt >= fib_info_hash_size) {
+		unsigned int new_size = fib_info_hash_size << 1;
 		struct hlist_head *new_info_hash;
 		struct hlist_head *new_laddrhash;
 		unsigned int bytes;
@@ -729,15 +729,15 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
 		if (!new_size)
 			new_size = 1;
 		bytes = new_size * sizeof(struct hlist_head *);
-		new_info_hash = fib_hash_alloc(bytes);
-		new_laddrhash = fib_hash_alloc(bytes);
+		new_info_hash = fib_info_hash_alloc(bytes);
+		new_laddrhash = fib_info_hash_alloc(bytes);
 		if (!new_info_hash || !new_laddrhash) {
-			fib_hash_free(new_info_hash, bytes);
-			fib_hash_free(new_laddrhash, bytes);
+			fib_info_hash_free(new_info_hash, bytes);
+			fib_info_hash_free(new_laddrhash, bytes);
 		} else
-			fib_hash_move(new_info_hash, new_laddrhash, new_size);
+			fib_info_hash_move(new_info_hash, new_laddrhash, new_size);
 
-		if (!fib_hash_size)
+		if (!fib_info_hash_size)
 			goto failure;
 	}
 
-- 
1.7.4


^ permalink raw reply related

* [PATCH] ipv4: Update some fib_hash centric interface names.
From: David Miller @ 2011-02-01 23:38 UTC (permalink / raw)
  To: netdev


fib_hash_init() --> fib_trie_init()
fib_hash_table() --> fib_trie_table()

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/ip_fib.h    |    6 +++---
 net/ipv4/fib_frontend.c |    8 ++++----
 net/ipv4/fib_trie.c     |    5 ++---
 3 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 819d61c..08b46b8 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -228,9 +228,9 @@ extern int fib_sync_up(struct net_device *dev);
 extern __be32  __fib_res_prefsrc(struct fib_result *res);
 extern void fib_select_multipath(const struct flowi *flp, struct fib_result *res);
 
-/* Exported by fib_{hash|trie}.c */
-extern void fib_hash_init(void);
-extern struct fib_table *fib_hash_table(u32 id);
+/* Exported by fib_trie.c */
+extern void fib_trie_init(void);
+extern struct fib_table *fib_trie_table(u32 id);
 
 static inline void fib_combine_itag(u32 *itag, struct fib_result *res)
 {
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 930768b..2a49c06 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -51,11 +51,11 @@ static int __net_init fib4_rules_init(struct net *net)
 {
 	struct fib_table *local_table, *main_table;
 
-	local_table = fib_hash_table(RT_TABLE_LOCAL);
+	local_table = fib_trie_table(RT_TABLE_LOCAL);
 	if (local_table == NULL)
 		return -ENOMEM;
 
-	main_table  = fib_hash_table(RT_TABLE_MAIN);
+	main_table  = fib_trie_table(RT_TABLE_MAIN);
 	if (main_table == NULL)
 		goto fail;
 
@@ -82,7 +82,7 @@ struct fib_table *fib_new_table(struct net *net, u32 id)
 	if (tb)
 		return tb;
 
-	tb = fib_hash_table(id);
+	tb = fib_trie_table(id);
 	if (!tb)
 		return NULL;
 	h = id & (FIB_TABLE_HASHSZ - 1);
@@ -1086,5 +1086,5 @@ void __init ip_fib_init(void)
 	register_netdevice_notifier(&fib_netdev_notifier);
 	register_inetaddr_notifier(&fib_inetaddr_notifier);
 
-	fib_hash_init();
+	fib_trie_init();
 }
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 16d589c..73cb984 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1916,7 +1916,7 @@ int fib_table_dump(struct fib_table *tb, struct sk_buff *skb,
 	return skb->len;
 }
 
-void __init fib_hash_init(void)
+void __init fib_trie_init(void)
 {
 	fn_alias_kmem = kmem_cache_create("ip_fib_alias",
 					  sizeof(struct fib_alias),
@@ -1929,8 +1929,7 @@ void __init fib_hash_init(void)
 }
 
 
-/* Fix more generic FIB names for init later */
-struct fib_table *fib_hash_table(u32 id)
+struct fib_table *fib_trie_table(u32 id)
 {
 	struct fib_table *tb;
 	struct trie *t;
-- 
1.7.4


^ permalink raw reply related

* Re: [PATCH] ipv4: Remove fib_hash.
From: David Miller @ 2011-02-01 23:35 UTC (permalink / raw)
  To: shemminger; +Cc: netdev
In-Reply-To: <20110201153332.46a38ad8@nehalam>

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Tue, 1 Feb 2011 15:33:32 -0800

> On Tue, 1 Feb 2011 15:26:03 -0800
> Stephen Hemminger <shemminger@vyatta.com> wrote:
> 
>> On Tue, 01 Feb 2011 15:19:41 -0800 (PST)
>> David Miller <davem@davemloft.net> wrote:
>> 
>> > 
>> > The time has finally come to remove the hash based routing table
>> > implementation in ipv4.
>> > 
>> > FIB Trie is mature, well tested, and I've done an audit of it's code
>> > to confirm that it implements insert, delete, and lookup with the same
>> > identical semantics as fib_hash did.
>> > 
>> > If there are any semantic differences found in fib_trie, we should
>> > simply fix them.
>> > 
>> > I've placed the trie statistic config option under advanced router
>> > configuration.
>> > 
>> > Signed-off-by: David S. Miller <davem@davemloft.net>
>> 
>> Vyatta has been shipping FIB TRIE for over 3 years with reported
>> bugs.
> 
> NO reported bugs!!

Understood. :-)

Thanks for reviewing Stephen.

^ permalink raw reply

* Re: [PATCH] ipv4: Remove fib_hash.
From: Stephen Hemminger @ 2011-02-01 23:33 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, netdev
In-Reply-To: <20110201152603.3ce5bd26@nehalam>

On Tue, 1 Feb 2011 15:26:03 -0800
Stephen Hemminger <shemminger@vyatta.com> wrote:

> On Tue, 01 Feb 2011 15:19:41 -0800 (PST)
> David Miller <davem@davemloft.net> wrote:
> 
> > 
> > The time has finally come to remove the hash based routing table
> > implementation in ipv4.
> > 
> > FIB Trie is mature, well tested, and I've done an audit of it's code
> > to confirm that it implements insert, delete, and lookup with the same
> > identical semantics as fib_hash did.
> > 
> > If there are any semantic differences found in fib_trie, we should
> > simply fix them.
> > 
> > I've placed the trie statistic config option under advanced router
> > configuration.
> > 
> > Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> Vyatta has been shipping FIB TRIE for over 3 years with reported
> bugs.

NO reported bugs!!

-- 

^ permalink raw reply

* Re: [PATCHv2 dontapply] vhost-net tx tuning
From: Shirley Ma @ 2011-02-01 23:27 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: Michael S. Tsirkin, Steve Dobbelstein, mashirle, kvm, netdev
In-Reply-To: <1296601658.30191.46.camel@sridhar.beaverton.ibm.com>

On Tue, 2011-02-01 at 15:07 -0800, Sridhar Samudrala wrote:
> I think the counters that exceed the limits need to be reset to 0
> here.
> Otherwise we keep signaling for every buffer once we hit this
> condition. 

I will modify the patch to rerun the test to see the difference.

Shirley


^ permalink raw reply

* Re: [PATCH] ipv4: Remove fib_hash.
From: Stephen Hemminger @ 2011-02-01 23:26 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20110201.151941.48509941.davem@davemloft.net>

On Tue, 01 Feb 2011 15:19:41 -0800 (PST)
David Miller <davem@davemloft.net> wrote:

> 
> The time has finally come to remove the hash based routing table
> implementation in ipv4.
> 
> FIB Trie is mature, well tested, and I've done an audit of it's code
> to confirm that it implements insert, delete, and lookup with the same
> identical semantics as fib_hash did.
> 
> If there are any semantic differences found in fib_trie, we should
> simply fix them.
> 
> I've placed the trie statistic config option under advanced router
> configuration.
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>

Vyatta has been shipping FIB TRIE for over 3 years with reported
bugs.

Acked-by: Stephen Hemminger <shemminger@vyatta.com>

-- 

^ permalink raw reply

* [PATCH] ipv4: Remove fib_hash.
From: David Miller @ 2011-02-01 23:19 UTC (permalink / raw)
  To: netdev


The time has finally come to remove the hash based routing table
implementation in ipv4.

FIB Trie is mature, well tested, and I've done an audit of it's code
to confirm that it implements insert, delete, and lookup with the same
identical semantics as fib_hash did.

If there are any semantic differences found in fib_trie, we should
simply fix them.

I've placed the trie statistic config option under advanced router
configuration.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/Kconfig    |   38 +--
 net/ipv4/Makefile   |    4 +-
 net/ipv4/fib_hash.c | 1061 ---------------------------------------------------
 3 files changed, 2 insertions(+), 1101 deletions(-)
 delete mode 100644 net/ipv4/fib_hash.c

diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index 8949a05..cbb505b 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -55,45 +55,9 @@ config IP_ADVANCED_ROUTER
 
 	  If unsure, say N here.
 
-choice
-	prompt "Choose IP: FIB lookup algorithm (choose FIB_HASH if unsure)"
-	depends on IP_ADVANCED_ROUTER
-	default ASK_IP_FIB_HASH
-
-config ASK_IP_FIB_HASH
-	bool "FIB_HASH"
-	---help---
-	  Current FIB is very proven and good enough for most users.
-
-config IP_FIB_TRIE
-	bool "FIB_TRIE"
-	---help---
-	  Use new experimental LC-trie as FIB lookup algorithm.
-	  This improves lookup performance if you have a large
-	  number of routes.
-
-	  LC-trie is a longest matching prefix lookup algorithm which
-	  performs better than FIB_HASH for large routing tables.
-	  But, it consumes more memory and is more complex.
-
-	  LC-trie is described in:
-
-	  IP-address lookup using LC-tries. Stefan Nilsson and Gunnar Karlsson
-	  IEEE Journal on Selected Areas in Communications, 17(6):1083-1092,
-	  June 1999
-
-	  An experimental study of compression methods for dynamic tries
-	  Stefan Nilsson and Matti Tikkanen. Algorithmica, 33(1):19-33, 2002.
-	  <http://www.csc.kth.se/~snilsson/software/dyntrie2/>
-
-endchoice
-
-config IP_FIB_HASH
-	def_bool ASK_IP_FIB_HASH || !IP_ADVANCED_ROUTER
-
 config IP_FIB_TRIE_STATS
 	bool "FIB TRIE statistics"
-	depends on IP_FIB_TRIE
+	depends on IP_ADVANCED_ROUTER
 	---help---
 	  Keep track of statistics on structure of FIB TRIE table.
 	  Useful for testing and measuring TRIE performance.
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index 4978d22..0dc772d 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -10,12 +10,10 @@ obj-y     := route.o inetpeer.o protocol.o \
 	     tcp_minisocks.o tcp_cong.o \
 	     datagram.o raw.o udp.o udplite.o \
 	     arp.o icmp.o devinet.o af_inet.o  igmp.o \
-	     fib_frontend.o fib_semantics.o \
+	     fib_frontend.o fib_semantics.o fib_trie.o \
 	     inet_fragment.o
 
 obj-$(CONFIG_SYSCTL) += sysctl_net_ipv4.o
-obj-$(CONFIG_IP_FIB_HASH) += fib_hash.o
-obj-$(CONFIG_IP_FIB_TRIE) += fib_trie.o
 obj-$(CONFIG_PROC_FS) += proc.o
 obj-$(CONFIG_IP_MULTIPLE_TABLES) += fib_rules.o
 obj-$(CONFIG_IP_MROUTE) += ipmr.o
diff --git a/net/ipv4/fib_hash.c b/net/ipv4/fib_hash.c
deleted file mode 100644
index fadb602..0000000
--- a/net/ipv4/fib_hash.c
+++ /dev/null
@@ -1,1061 +0,0 @@
-/*
- * INET		An implementation of the TCP/IP protocol suite for the LINUX
- *		operating system.  INET is implemented using the  BSD Socket
- *		interface as the means of communication with the user level.
- *
- *		IPv4 FIB: lookup engine and maintenance routines.
- *
- * Authors:	Alexey Kuznetsov, <kuznet@ms2.inr.ac.ru>
- *
- *		This program is free software; you can redistribute it and/or
- *		modify it under the terms of the GNU General Public License
- *		as published by the Free Software Foundation; either version
- *		2 of the License, or (at your option) any later version.
- */
-
-#include <asm/uaccess.h>
-#include <asm/system.h>
-#include <linux/bitops.h>
-#include <linux/types.h>
-#include <linux/kernel.h>
-#include <linux/mm.h>
-#include <linux/string.h>
-#include <linux/socket.h>
-#include <linux/sockios.h>
-#include <linux/errno.h>
-#include <linux/in.h>
-#include <linux/inet.h>
-#include <linux/inetdevice.h>
-#include <linux/netdevice.h>
-#include <linux/if_arp.h>
-#include <linux/proc_fs.h>
-#include <linux/skbuff.h>
-#include <linux/netlink.h>
-#include <linux/init.h>
-#include <linux/slab.h>
-
-#include <net/net_namespace.h>
-#include <net/ip.h>
-#include <net/protocol.h>
-#include <net/route.h>
-#include <net/tcp.h>
-#include <net/sock.h>
-#include <net/ip_fib.h>
-
-#include "fib_lookup.h"
-
-static struct kmem_cache *fn_hash_kmem __read_mostly;
-static struct kmem_cache *fn_alias_kmem __read_mostly;
-
-struct fib_node {
-	struct hlist_node	fn_hash;
-	struct list_head	fn_alias;
-	__be32			fn_key;
-	struct fib_alias        fn_embedded_alias;
-};
-
-#define EMBEDDED_HASH_SIZE (L1_CACHE_BYTES / sizeof(struct hlist_head))
-
-struct fn_zone {
-	struct fn_zone __rcu	*fz_next;	/* Next not empty zone	*/
-	struct hlist_head __rcu	*fz_hash;	/* Hash table pointer	*/
-	seqlock_t		fz_lock;
-	u32			fz_hashmask;	/* (fz_divisor - 1)	*/
-
-	u8			fz_order;	/* Zone order (0..32)	*/
-	u8			fz_revorder;	/* 32 - fz_order	*/
-	__be32			fz_mask;	/* inet_make_mask(order) */
-#define FZ_MASK(fz)		((fz)->fz_mask)
-
-	struct hlist_head	fz_embedded_hash[EMBEDDED_HASH_SIZE];
-
-	int			fz_nent;	/* Number of entries	*/
-	int			fz_divisor;	/* Hash size (mask+1)	*/
-};
-
-struct fn_hash {
-	struct fn_zone		*fn_zones[33];
-	struct fn_zone __rcu	*fn_zone_list;
-};
-
-static inline u32 fn_hash(__be32 key, struct fn_zone *fz)
-{
-	u32 h = ntohl(key) >> fz->fz_revorder;
-	h ^= (h>>20);
-	h ^= (h>>10);
-	h ^= (h>>5);
-	h &= fz->fz_hashmask;
-	return h;
-}
-
-static inline __be32 fz_key(__be32 dst, struct fn_zone *fz)
-{
-	return dst & FZ_MASK(fz);
-}
-
-static unsigned int fib_hash_genid;
-
-#define FZ_MAX_DIVISOR ((PAGE_SIZE<<MAX_ORDER) / sizeof(struct hlist_head))
-
-static struct hlist_head *fz_hash_alloc(int divisor)
-{
-	unsigned long size = divisor * sizeof(struct hlist_head);
-
-	if (size <= PAGE_SIZE)
-		return kzalloc(size, GFP_KERNEL);
-
-	return (struct hlist_head *)
-		__get_free_pages(GFP_KERNEL | __GFP_ZERO, get_order(size));
-}
-
-/* The fib hash lock must be held when this is called. */
-static inline void fn_rebuild_zone(struct fn_zone *fz,
-				   struct hlist_head *old_ht,
-				   int old_divisor)
-{
-	int i;
-
-	for (i = 0; i < old_divisor; i++) {
-		struct hlist_node *node, *n;
-		struct fib_node *f;
-
-		hlist_for_each_entry_safe(f, node, n, &old_ht[i], fn_hash) {
-			struct hlist_head *new_head;
-
-			hlist_del_rcu(&f->fn_hash);
-
-			new_head = rcu_dereference_protected(fz->fz_hash, 1) +
-				   fn_hash(f->fn_key, fz);
-			hlist_add_head_rcu(&f->fn_hash, new_head);
-		}
-	}
-}
-
-static void fz_hash_free(struct hlist_head *hash, int divisor)
-{
-	unsigned long size = divisor * sizeof(struct hlist_head);
-
-	if (size <= PAGE_SIZE)
-		kfree(hash);
-	else
-		free_pages((unsigned long)hash, get_order(size));
-}
-
-static void fn_rehash_zone(struct fn_zone *fz)
-{
-	struct hlist_head *ht, *old_ht;
-	int old_divisor, new_divisor;
-	u32 new_hashmask;
-
-	new_divisor = old_divisor = fz->fz_divisor;
-
-	switch (old_divisor) {
-	case EMBEDDED_HASH_SIZE:
-		new_divisor *= EMBEDDED_HASH_SIZE;
-		break;
-	case EMBEDDED_HASH_SIZE*EMBEDDED_HASH_SIZE:
-		new_divisor *= (EMBEDDED_HASH_SIZE/2);
-		break;
-	default:
-		if ((old_divisor << 1) > FZ_MAX_DIVISOR) {
-			printk(KERN_CRIT "route.c: bad divisor %d!\n", old_divisor);
-			return;
-		}
-		new_divisor = (old_divisor << 1);
-		break;
-	}
-
-	new_hashmask = (new_divisor - 1);
-
-#if RT_CACHE_DEBUG >= 2
-	printk(KERN_DEBUG "fn_rehash_zone: hash for zone %d grows from %d\n",
-	       fz->fz_order, old_divisor);
-#endif
-
-	ht = fz_hash_alloc(new_divisor);
-
-	if (ht)	{
-		struct fn_zone nfz;
-
-		memcpy(&nfz, fz, sizeof(nfz));
-
-		write_seqlock_bh(&fz->fz_lock);
-		old_ht = rcu_dereference_protected(fz->fz_hash, 1);
-		RCU_INIT_POINTER(nfz.fz_hash, ht);
-		nfz.fz_hashmask = new_hashmask;
-		nfz.fz_divisor = new_divisor;
-		fn_rebuild_zone(&nfz, old_ht, old_divisor);
-		fib_hash_genid++;
-		rcu_assign_pointer(fz->fz_hash, ht);
-		fz->fz_hashmask = new_hashmask;
-		fz->fz_divisor = new_divisor;
-		write_sequnlock_bh(&fz->fz_lock);
-
-		if (old_ht != fz->fz_embedded_hash) {
-			synchronize_rcu();
-			fz_hash_free(old_ht, old_divisor);
-		}
-	}
-}
-
-static void fn_free_node_rcu(struct rcu_head *head)
-{
-	struct fib_node *f = container_of(head, struct fib_node, fn_embedded_alias.rcu);
-
-	kmem_cache_free(fn_hash_kmem, f);
-}
-
-static inline void fn_free_node(struct fib_node *f)
-{
-	call_rcu(&f->fn_embedded_alias.rcu, fn_free_node_rcu);
-}
-
-static void fn_free_alias_rcu(struct rcu_head *head)
-{
-	struct fib_alias *fa = container_of(head, struct fib_alias, rcu);
-
-	kmem_cache_free(fn_alias_kmem, fa);
-}
-
-static inline void fn_free_alias(struct fib_alias *fa, struct fib_node *f)
-{
-	fib_release_info(fa->fa_info);
-	if (fa == &f->fn_embedded_alias)
-		fa->fa_info = NULL;
-	else
-		call_rcu(&fa->rcu, fn_free_alias_rcu);
-}
-
-static struct fn_zone *
-fn_new_zone(struct fn_hash *table, int z)
-{
-	int i;
-	struct fn_zone *fz = kzalloc(sizeof(struct fn_zone), GFP_KERNEL);
-	if (!fz)
-		return NULL;
-
-	seqlock_init(&fz->fz_lock);
-	fz->fz_divisor = z ? EMBEDDED_HASH_SIZE : 1;
-	fz->fz_hashmask = fz->fz_divisor - 1;
-	RCU_INIT_POINTER(fz->fz_hash, fz->fz_embedded_hash);
-	fz->fz_order = z;
-	fz->fz_revorder = 32 - z;
-	fz->fz_mask = inet_make_mask(z);
-
-	/* Find the first not empty zone with more specific mask */
-	for (i = z + 1; i <= 32; i++)
-		if (table->fn_zones[i])
-			break;
-	if (i > 32) {
-		/* No more specific masks, we are the first. */
-		rcu_assign_pointer(fz->fz_next,
-				   rtnl_dereference(table->fn_zone_list));
-		rcu_assign_pointer(table->fn_zone_list, fz);
-	} else {
-		rcu_assign_pointer(fz->fz_next,
-				   rtnl_dereference(table->fn_zones[i]->fz_next));
-		rcu_assign_pointer(table->fn_zones[i]->fz_next, fz);
-	}
-	table->fn_zones[z] = fz;
-	fib_hash_genid++;
-	return fz;
-}
-
-int fib_table_lookup(struct fib_table *tb,
-		     const struct flowi *flp, struct fib_result *res,
-		     int fib_flags)
-{
-	int err;
-	struct fn_zone *fz;
-	struct fn_hash *t = (struct fn_hash *)tb->tb_data;
-
-	rcu_read_lock();
-	for (fz = rcu_dereference(t->fn_zone_list);
-	     fz != NULL;
-	     fz = rcu_dereference(fz->fz_next)) {
-		struct hlist_head *head;
-		struct hlist_node *node;
-		struct fib_node *f;
-		__be32 k;
-		unsigned int seq;
-
-		do {
-			seq = read_seqbegin(&fz->fz_lock);
-			k = fz_key(flp->fl4_dst, fz);
-
-			head = rcu_dereference(fz->fz_hash) + fn_hash(k, fz);
-			hlist_for_each_entry_rcu(f, node, head, fn_hash) {
-				if (f->fn_key != k)
-					continue;
-
-				err = fib_semantic_match(tb, &f->fn_alias,
-						 flp, res,
-						 fz->fz_order, fib_flags);
-				if (err <= 0)
-					goto out;
-			}
-		} while (read_seqretry(&fz->fz_lock, seq));
-	}
-	err = 1;
-out:
-	rcu_read_unlock();
-	return err;
-}
-
-/* Insert node F to FZ. */
-static inline void fib_insert_node(struct fn_zone *fz, struct fib_node *f)
-{
-	struct hlist_head *head = rtnl_dereference(fz->fz_hash) + fn_hash(f->fn_key, fz);
-
-	hlist_add_head_rcu(&f->fn_hash, head);
-}
-
-/* Return the node in FZ matching KEY. */
-static struct fib_node *fib_find_node(struct fn_zone *fz, __be32 key)
-{
-	struct hlist_head *head = rtnl_dereference(fz->fz_hash) + fn_hash(key, fz);
-	struct hlist_node *node;
-	struct fib_node *f;
-
-	hlist_for_each_entry_rcu(f, node, head, fn_hash) {
-		if (f->fn_key == key)
-			return f;
-	}
-
-	return NULL;
-}
-
-
-static struct fib_alias *fib_fast_alloc(struct fib_node *f)
-{
-	struct fib_alias *fa = &f->fn_embedded_alias;
-
-	if (fa->fa_info != NULL)
-		fa = kmem_cache_alloc(fn_alias_kmem, GFP_KERNEL);
-	return fa;
-}
-
-/* Caller must hold RTNL. */
-int fib_table_insert(struct fib_table *tb, struct fib_config *cfg)
-{
-	struct fn_hash *table = (struct fn_hash *) tb->tb_data;
-	struct fib_node *new_f = NULL;
-	struct fib_node *f;
-	struct fib_alias *fa, *new_fa;
-	struct fn_zone *fz;
-	struct fib_info *fi;
-	u8 tos = cfg->fc_tos;
-	__be32 key;
-	int err;
-
-	if (cfg->fc_dst_len > 32)
-		return -EINVAL;
-
-	fz = table->fn_zones[cfg->fc_dst_len];
-	if (!fz && !(fz = fn_new_zone(table, cfg->fc_dst_len)))
-		return -ENOBUFS;
-
-	key = 0;
-	if (cfg->fc_dst) {
-		if (cfg->fc_dst & ~FZ_MASK(fz))
-			return -EINVAL;
-		key = fz_key(cfg->fc_dst, fz);
-	}
-
-	fi = fib_create_info(cfg);
-	if (IS_ERR(fi))
-		return PTR_ERR(fi);
-
-	if (fz->fz_nent > (fz->fz_divisor<<1) &&
-	    fz->fz_divisor < FZ_MAX_DIVISOR &&
-	    (cfg->fc_dst_len == 32 ||
-	     (1 << cfg->fc_dst_len) > fz->fz_divisor))
-		fn_rehash_zone(fz);
-
-	f = fib_find_node(fz, key);
-
-	if (!f)
-		fa = NULL;
-	else
-		fa = fib_find_alias(&f->fn_alias, tos, fi->fib_priority);
-
-	/* Now fa, if non-NULL, points to the first fib alias
-	 * with the same keys [prefix,tos,priority], if such key already
-	 * exists or to the node before which we will insert new one.
-	 *
-	 * If fa is NULL, we will need to allocate a new one and
-	 * insert to the head of f.
-	 *
-	 * If f is NULL, no fib node matched the destination key
-	 * and we need to allocate a new one of those as well.
-	 */
-
-	if (fa && fa->fa_tos == tos &&
-	    fa->fa_info->fib_priority == fi->fib_priority) {
-		struct fib_alias *fa_first, *fa_match;
-
-		err = -EEXIST;
-		if (cfg->fc_nlflags & NLM_F_EXCL)
-			goto out;
-
-		/* We have 2 goals:
-		 * 1. Find exact match for type, scope, fib_info to avoid
-		 * duplicate routes
-		 * 2. Find next 'fa' (or head), NLM_F_APPEND inserts before it
-		 */
-		fa_match = NULL;
-		fa_first = fa;
-		fa = list_entry(fa->fa_list.prev, struct fib_alias, fa_list);
-		list_for_each_entry_continue(fa, &f->fn_alias, fa_list) {
-			if (fa->fa_tos != tos)
-				break;
-			if (fa->fa_info->fib_priority != fi->fib_priority)
-				break;
-			if (fa->fa_type == cfg->fc_type &&
-			    fa->fa_scope == cfg->fc_scope &&
-			    fa->fa_info == fi) {
-				fa_match = fa;
-				break;
-			}
-		}
-
-		if (cfg->fc_nlflags & NLM_F_REPLACE) {
-			u8 state;
-
-			fa = fa_first;
-			if (fa_match) {
-				if (fa == fa_match)
-					err = 0;
-				goto out;
-			}
-			err = -ENOBUFS;
-			new_fa = fib_fast_alloc(f);
-			if (new_fa == NULL)
-				goto out;
-
-			new_fa->fa_tos = fa->fa_tos;
-			new_fa->fa_info = fi;
-			new_fa->fa_type = cfg->fc_type;
-			new_fa->fa_scope = cfg->fc_scope;
-			state = fa->fa_state;
-			new_fa->fa_state = state & ~FA_S_ACCESSED;
-			fib_hash_genid++;
-			list_replace_rcu(&fa->fa_list, &new_fa->fa_list);
-
-			fn_free_alias(fa, f);
-			if (state & FA_S_ACCESSED)
-				rt_cache_flush(cfg->fc_nlinfo.nl_net, -1);
-			rtmsg_fib(RTM_NEWROUTE, key, new_fa, cfg->fc_dst_len,
-				  tb->tb_id, &cfg->fc_nlinfo, NLM_F_REPLACE);
-			return 0;
-		}
-
-		/* Error if we find a perfect match which
-		 * uses the same scope, type, and nexthop
-		 * information.
-		 */
-		if (fa_match)
-			goto out;
-
-		if (!(cfg->fc_nlflags & NLM_F_APPEND))
-			fa = fa_first;
-	}
-
-	err = -ENOENT;
-	if (!(cfg->fc_nlflags & NLM_F_CREATE))
-		goto out;
-
-	err = -ENOBUFS;
-
-	if (!f) {
-		new_f = kmem_cache_zalloc(fn_hash_kmem, GFP_KERNEL);
-		if (new_f == NULL)
-			goto out;
-
-		INIT_HLIST_NODE(&new_f->fn_hash);
-		INIT_LIST_HEAD(&new_f->fn_alias);
-		new_f->fn_key = key;
-		f = new_f;
-	}
-
-	new_fa = fib_fast_alloc(f);
-	if (new_fa == NULL)
-		goto out;
-
-	new_fa->fa_info = fi;
-	new_fa->fa_tos = tos;
-	new_fa->fa_type = cfg->fc_type;
-	new_fa->fa_scope = cfg->fc_scope;
-	new_fa->fa_state = 0;
-
-	/*
-	 * Insert new entry to the list.
-	 */
-
-	if (new_f)
-		fib_insert_node(fz, new_f);
-	list_add_tail_rcu(&new_fa->fa_list,
-		 (fa ? &fa->fa_list : &f->fn_alias));
-	fib_hash_genid++;
-
-	if (new_f)
-		fz->fz_nent++;
-	rt_cache_flush(cfg->fc_nlinfo.nl_net, -1);
-
-	rtmsg_fib(RTM_NEWROUTE, key, new_fa, cfg->fc_dst_len, tb->tb_id,
-		  &cfg->fc_nlinfo, 0);
-	return 0;
-
-out:
-	if (new_f)
-		kmem_cache_free(fn_hash_kmem, new_f);
-	fib_release_info(fi);
-	return err;
-}
-
-int fib_table_delete(struct fib_table *tb, struct fib_config *cfg)
-{
-	struct fn_hash *table = (struct fn_hash *)tb->tb_data;
-	struct fib_node *f;
-	struct fib_alias *fa, *fa_to_delete;
-	struct fn_zone *fz;
-	__be32 key;
-
-	if (cfg->fc_dst_len > 32)
-		return -EINVAL;
-
-	if ((fz  = table->fn_zones[cfg->fc_dst_len]) == NULL)
-		return -ESRCH;
-
-	key = 0;
-	if (cfg->fc_dst) {
-		if (cfg->fc_dst & ~FZ_MASK(fz))
-			return -EINVAL;
-		key = fz_key(cfg->fc_dst, fz);
-	}
-
-	f = fib_find_node(fz, key);
-
-	if (!f)
-		fa = NULL;
-	else
-		fa = fib_find_alias(&f->fn_alias, cfg->fc_tos, 0);
-	if (!fa)
-		return -ESRCH;
-
-	fa_to_delete = NULL;
-	fa = list_entry(fa->fa_list.prev, struct fib_alias, fa_list);
-	list_for_each_entry_continue(fa, &f->fn_alias, fa_list) {
-		struct fib_info *fi = fa->fa_info;
-
-		if (fa->fa_tos != cfg->fc_tos)
-			break;
-
-		if ((!cfg->fc_type ||
-		     fa->fa_type == cfg->fc_type) &&
-		    (cfg->fc_scope == RT_SCOPE_NOWHERE ||
-		     fa->fa_scope == cfg->fc_scope) &&
-		    (!cfg->fc_protocol ||
-		     fi->fib_protocol == cfg->fc_protocol) &&
-		    fib_nh_match(cfg, fi) == 0) {
-			fa_to_delete = fa;
-			break;
-		}
-	}
-
-	if (fa_to_delete) {
-		int kill_fn;
-
-		fa = fa_to_delete;
-		rtmsg_fib(RTM_DELROUTE, key, fa, cfg->fc_dst_len,
-			  tb->tb_id, &cfg->fc_nlinfo, 0);
-
-		kill_fn = 0;
-		list_del_rcu(&fa->fa_list);
-		if (list_empty(&f->fn_alias)) {
-			hlist_del_rcu(&f->fn_hash);
-			kill_fn = 1;
-		}
-		fib_hash_genid++;
-
-		if (fa->fa_state & FA_S_ACCESSED)
-			rt_cache_flush(cfg->fc_nlinfo.nl_net, -1);
-		fn_free_alias(fa, f);
-		if (kill_fn) {
-			fn_free_node(f);
-			fz->fz_nent--;
-		}
-
-		return 0;
-	}
-	return -ESRCH;
-}
-
-static int fn_flush_list(struct fn_zone *fz, int idx)
-{
-	struct hlist_head *head = rtnl_dereference(fz->fz_hash) + idx;
-	struct hlist_node *node, *n;
-	struct fib_node *f;
-	int found = 0;
-
-	hlist_for_each_entry_safe(f, node, n, head, fn_hash) {
-		struct fib_alias *fa, *fa_node;
-		int kill_f;
-
-		kill_f = 0;
-		list_for_each_entry_safe(fa, fa_node, &f->fn_alias, fa_list) {
-			struct fib_info *fi = fa->fa_info;
-
-			if (fi && (fi->fib_flags&RTNH_F_DEAD)) {
-				list_del_rcu(&fa->fa_list);
-				if (list_empty(&f->fn_alias)) {
-					hlist_del_rcu(&f->fn_hash);
-					kill_f = 1;
-				}
-				fib_hash_genid++;
-
-				fn_free_alias(fa, f);
-				found++;
-			}
-		}
-		if (kill_f) {
-			fn_free_node(f);
-			fz->fz_nent--;
-		}
-	}
-	return found;
-}
-
-/* caller must hold RTNL. */
-int fib_table_flush(struct fib_table *tb)
-{
-	struct fn_hash *table = (struct fn_hash *) tb->tb_data;
-	struct fn_zone *fz;
-	int found = 0;
-
-	for (fz = rtnl_dereference(table->fn_zone_list);
-	     fz != NULL;
-	     fz = rtnl_dereference(fz->fz_next)) {
-		int i;
-
-		for (i = fz->fz_divisor - 1; i >= 0; i--)
-			found += fn_flush_list(fz, i);
-	}
-	return found;
-}
-
-void fib_free_table(struct fib_table *tb)
-{
-	struct fn_hash *table = (struct fn_hash *) tb->tb_data;
-	struct fn_zone *fz, *next;
-
-	next = table->fn_zone_list;
-	while (next != NULL) {
-		fz = next;
-		next = fz->fz_next;
-
-		if (fz->fz_hash != fz->fz_embedded_hash)
-			fz_hash_free(fz->fz_hash, fz->fz_divisor);
-
-		kfree(fz);
-	}
-
-	kfree(tb);
-}
-
-static inline int
-fn_hash_dump_bucket(struct sk_buff *skb, struct netlink_callback *cb,
-		     struct fib_table *tb,
-		     struct fn_zone *fz,
-		     struct hlist_head *head)
-{
-	struct hlist_node *node;
-	struct fib_node *f;
-	int i, s_i;
-
-	s_i = cb->args[4];
-	i = 0;
-	hlist_for_each_entry_rcu(f, node, head, fn_hash) {
-		struct fib_alias *fa;
-
-		list_for_each_entry_rcu(fa, &f->fn_alias, fa_list) {
-			if (i < s_i)
-				goto next;
-
-			if (fib_dump_info(skb, NETLINK_CB(cb->skb).pid,
-					  cb->nlh->nlmsg_seq,
-					  RTM_NEWROUTE,
-					  tb->tb_id,
-					  fa->fa_type,
-					  fa->fa_scope,
-					  f->fn_key,
-					  fz->fz_order,
-					  fa->fa_tos,
-					  fa->fa_info,
-					  NLM_F_MULTI) < 0) {
-				cb->args[4] = i;
-				return -1;
-			}
-next:
-			i++;
-		}
-	}
-	cb->args[4] = i;
-	return skb->len;
-}
-
-static inline int
-fn_hash_dump_zone(struct sk_buff *skb, struct netlink_callback *cb,
-		   struct fib_table *tb,
-		   struct fn_zone *fz)
-{
-	int h, s_h;
-	struct hlist_head *head = rcu_dereference(fz->fz_hash);
-
-	if (head == NULL)
-		return skb->len;
-	s_h = cb->args[3];
-	for (h = s_h; h < fz->fz_divisor; h++) {
-		if (hlist_empty(head + h))
-			continue;
-		if (fn_hash_dump_bucket(skb, cb, tb, fz, head + h) < 0) {
-			cb->args[3] = h;
-			return -1;
-		}
-		memset(&cb->args[4], 0,
-		       sizeof(cb->args) - 4*sizeof(cb->args[0]));
-	}
-	cb->args[3] = h;
-	return skb->len;
-}
-
-int fib_table_dump(struct fib_table *tb, struct sk_buff *skb,
-		   struct netlink_callback *cb)
-{
-	int m = 0, s_m;
-	struct fn_zone *fz;
-	struct fn_hash *table = (struct fn_hash *)tb->tb_data;
-
-	s_m = cb->args[2];
-	rcu_read_lock();
-	for (fz = rcu_dereference(table->fn_zone_list);
-	     fz != NULL;
-	     fz = rcu_dereference(fz->fz_next), m++) {
-		if (m < s_m)
-			continue;
-		if (fn_hash_dump_zone(skb, cb, tb, fz) < 0) {
-			cb->args[2] = m;
-			rcu_read_unlock();
-			return -1;
-		}
-		memset(&cb->args[3], 0,
-		       sizeof(cb->args) - 3*sizeof(cb->args[0]));
-	}
-	rcu_read_unlock();
-	cb->args[2] = m;
-	return skb->len;
-}
-
-void __init fib_hash_init(void)
-{
-	fn_hash_kmem = kmem_cache_create("ip_fib_hash", sizeof(struct fib_node),
-					 0, SLAB_PANIC, NULL);
-
-	fn_alias_kmem = kmem_cache_create("ip_fib_alias", sizeof(struct fib_alias),
-					  0, SLAB_PANIC, NULL);
-
-}
-
-struct fib_table *fib_hash_table(u32 id)
-{
-	struct fib_table *tb;
-
-	tb = kmalloc(sizeof(struct fib_table) + sizeof(struct fn_hash),
-		     GFP_KERNEL);
-	if (tb == NULL)
-		return NULL;
-
-	tb->tb_id = id;
-	tb->tb_default = -1;
-
-	memset(tb->tb_data, 0, sizeof(struct fn_hash));
-	return tb;
-}
-
-/* ------------------------------------------------------------------------ */
-#ifdef CONFIG_PROC_FS
-
-struct fib_iter_state {
-	struct seq_net_private p;
-	struct fn_zone	*zone;
-	int		bucket;
-	struct hlist_head *hash_head;
-	struct fib_node *fn;
-	struct fib_alias *fa;
-	loff_t pos;
-	unsigned int genid;
-	int valid;
-};
-
-static struct fib_alias *fib_get_first(struct seq_file *seq)
-{
-	struct fib_iter_state *iter = seq->private;
-	struct fib_table *main_table;
-	struct fn_hash *table;
-
-	main_table = fib_get_table(seq_file_net(seq), RT_TABLE_MAIN);
-	table = (struct fn_hash *)main_table->tb_data;
-
-	iter->bucket    = 0;
-	iter->hash_head = NULL;
-	iter->fn        = NULL;
-	iter->fa        = NULL;
-	iter->pos	= 0;
-	iter->genid	= fib_hash_genid;
-	iter->valid	= 1;
-
-	for (iter->zone = rcu_dereference(table->fn_zone_list);
-	     iter->zone != NULL;
-	     iter->zone = rcu_dereference(iter->zone->fz_next)) {
-		int maxslot;
-
-		if (!iter->zone->fz_nent)
-			continue;
-
-		iter->hash_head = rcu_dereference(iter->zone->fz_hash);
-		maxslot = iter->zone->fz_divisor;
-
-		for (iter->bucket = 0; iter->bucket < maxslot;
-		     ++iter->bucket, ++iter->hash_head) {
-			struct hlist_node *node;
-			struct fib_node *fn;
-
-			hlist_for_each_entry(fn, node, iter->hash_head, fn_hash) {
-				struct fib_alias *fa;
-
-				list_for_each_entry(fa, &fn->fn_alias, fa_list) {
-					iter->fn = fn;
-					iter->fa = fa;
-					goto out;
-				}
-			}
-		}
-	}
-out:
-	return iter->fa;
-}
-
-static struct fib_alias *fib_get_next(struct seq_file *seq)
-{
-	struct fib_iter_state *iter = seq->private;
-	struct fib_node *fn;
-	struct fib_alias *fa;
-
-	/* Advance FA, if any. */
-	fn = iter->fn;
-	fa = iter->fa;
-	if (fa) {
-		BUG_ON(!fn);
-		list_for_each_entry_continue(fa, &fn->fn_alias, fa_list) {
-			iter->fa = fa;
-			goto out;
-		}
-	}
-
-	fa = iter->fa = NULL;
-
-	/* Advance FN. */
-	if (fn) {
-		struct hlist_node *node = &fn->fn_hash;
-		hlist_for_each_entry_continue(fn, node, fn_hash) {
-			iter->fn = fn;
-
-			list_for_each_entry(fa, &fn->fn_alias, fa_list) {
-				iter->fa = fa;
-				goto out;
-			}
-		}
-	}
-
-	fn = iter->fn = NULL;
-
-	/* Advance hash chain. */
-	if (!iter->zone)
-		goto out;
-
-	for (;;) {
-		struct hlist_node *node;
-		int maxslot;
-
-		maxslot = iter->zone->fz_divisor;
-
-		while (++iter->bucket < maxslot) {
-			iter->hash_head++;
-
-			hlist_for_each_entry(fn, node, iter->hash_head, fn_hash) {
-				list_for_each_entry(fa, &fn->fn_alias, fa_list) {
-					iter->fn = fn;
-					iter->fa = fa;
-					goto out;
-				}
-			}
-		}
-
-		iter->zone = rcu_dereference(iter->zone->fz_next);
-
-		if (!iter->zone)
-			goto out;
-
-		iter->bucket = 0;
-		iter->hash_head = rcu_dereference(iter->zone->fz_hash);
-
-		hlist_for_each_entry(fn, node, iter->hash_head, fn_hash) {
-			list_for_each_entry(fa, &fn->fn_alias, fa_list) {
-				iter->fn = fn;
-				iter->fa = fa;
-				goto out;
-			}
-		}
-	}
-out:
-	iter->pos++;
-	return fa;
-}
-
-static struct fib_alias *fib_get_idx(struct seq_file *seq, loff_t pos)
-{
-	struct fib_iter_state *iter = seq->private;
-	struct fib_alias *fa;
-
-	if (iter->valid && pos >= iter->pos && iter->genid == fib_hash_genid) {
-		fa   = iter->fa;
-		pos -= iter->pos;
-	} else
-		fa = fib_get_first(seq);
-
-	if (fa)
-		while (pos && (fa = fib_get_next(seq)))
-			--pos;
-	return pos ? NULL : fa;
-}
-
-static void *fib_seq_start(struct seq_file *seq, loff_t *pos)
-	__acquires(RCU)
-{
-	void *v = NULL;
-
-	rcu_read_lock();
-	if (fib_get_table(seq_file_net(seq), RT_TABLE_MAIN))
-		v = *pos ? fib_get_idx(seq, *pos - 1) : SEQ_START_TOKEN;
-	return v;
-}
-
-static void *fib_seq_next(struct seq_file *seq, void *v, loff_t *pos)
-{
-	++*pos;
-	return v == SEQ_START_TOKEN ? fib_get_first(seq) : fib_get_next(seq);
-}
-
-static void fib_seq_stop(struct seq_file *seq, void *v)
-	__releases(RCU)
-{
-	rcu_read_unlock();
-}
-
-static unsigned fib_flag_trans(int type, __be32 mask, struct fib_info *fi)
-{
-	static const unsigned type2flags[RTN_MAX + 1] = {
-		[7] = RTF_REJECT,
-		[8] = RTF_REJECT,
-	};
-	unsigned flags = type2flags[type];
-
-	if (fi && fi->fib_nh->nh_gw)
-		flags |= RTF_GATEWAY;
-	if (mask == htonl(0xFFFFFFFF))
-		flags |= RTF_HOST;
-	flags |= RTF_UP;
-	return flags;
-}
-
-/*
- *	This outputs /proc/net/route.
- *
- *	It always works in backward compatibility mode.
- *	The format of the file is not supposed to be changed.
- */
-static int fib_seq_show(struct seq_file *seq, void *v)
-{
-	struct fib_iter_state *iter;
-	int len;
-	__be32 prefix, mask;
-	unsigned flags;
-	struct fib_node *f;
-	struct fib_alias *fa;
-	struct fib_info *fi;
-
-	if (v == SEQ_START_TOKEN) {
-		seq_printf(seq, "%-127s\n", "Iface\tDestination\tGateway "
-			   "\tFlags\tRefCnt\tUse\tMetric\tMask\t\tMTU"
-			   "\tWindow\tIRTT");
-		goto out;
-	}
-
-	iter	= seq->private;
-	f	= iter->fn;
-	fa	= iter->fa;
-	fi	= fa->fa_info;
-	prefix	= f->fn_key;
-	mask	= FZ_MASK(iter->zone);
-	flags	= fib_flag_trans(fa->fa_type, mask, fi);
-	if (fi)
-		seq_printf(seq,
-			 "%s\t%08X\t%08X\t%04X\t%d\t%u\t%d\t%08X\t%d\t%u\t%u%n",
-			 fi->fib_dev ? fi->fib_dev->name : "*", prefix,
-			 fi->fib_nh->nh_gw, flags, 0, 0, fi->fib_priority,
-			 mask, (fi->fib_advmss ? fi->fib_advmss + 40 : 0),
-			 fi->fib_window,
-			 fi->fib_rtt >> 3, &len);
-	else
-		seq_printf(seq,
-			 "*\t%08X\t%08X\t%04X\t%d\t%u\t%d\t%08X\t%d\t%u\t%u%n",
-			 prefix, 0, flags, 0, 0, 0, mask, 0, 0, 0, &len);
-
-	seq_printf(seq, "%*s\n", 127 - len, "");
-out:
-	return 0;
-}
-
-static const struct seq_operations fib_seq_ops = {
-	.start  = fib_seq_start,
-	.next   = fib_seq_next,
-	.stop   = fib_seq_stop,
-	.show   = fib_seq_show,
-};
-
-static int fib_seq_open(struct inode *inode, struct file *file)
-{
-	return seq_open_net(inode, file, &fib_seq_ops,
-			    sizeof(struct fib_iter_state));
-}
-
-static const struct file_operations fib_seq_fops = {
-	.owner		= THIS_MODULE,
-	.open           = fib_seq_open,
-	.read           = seq_read,
-	.llseek         = seq_lseek,
-	.release	= seq_release_net,
-};
-
-int __net_init fib_proc_init(struct net *net)
-{
-	if (!proc_net_fops_create(net, "route", S_IRUGO, &fib_seq_fops))
-		return -ENOMEM;
-	return 0;
-}
-
-void __net_exit fib_proc_exit(struct net *net)
-{
-	proc_net_remove(net, "route");
-}
-#endif /* CONFIG_PROC_FS */
-- 
1.7.4


^ permalink raw reply related

* Re: [PATCHv2 dontapply] vhost-net tx tuning
From: Sridhar Samudrala @ 2011-02-01 23:07 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Steve Dobbelstein, mashirle, kvm, netdev
In-Reply-To: <20110201155253.GA22959@redhat.com>

On Tue, 2011-02-01 at 17:52 +0200, Michael S. Tsirkin wrote:
> OK, so thinking about it more, maybe the issue is this:
> tx becomes full. We process one request and interrupt the guest,
> then it adds one request and the queue is full again.
> 
> Maybe the following will help it stabilize?  By default with it we will
> only interrupt when we see an empty ring.
> Which is liklely too much: pls try other values
> in the middle: e.g. make bufs half the ring,
> or bytes some small value like half ring * 200, or packets some
> small value etc.
> 
> Set any one parameter to 0 to get current
> behaviour (interrupt immediately when enabled).
> 
> Warning: completely untested.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> ---
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index aac05bc..6769cdc 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -32,6 +32,13 @@
>   * Using this limit prevents one virtqueue from starving others. */
>  #define VHOST_NET_WEIGHT 0x80000
> 
> +int tx_bytes_coalesce = 1000000000;
> +module_param(tx_bytes_coalesce, int, 0644);
> +int tx_bufs_coalesce = 1000000000;
> +module_param(tx_bufs_coalesce, int, 0644);
> +int tx_packets_coalesce = 1000000000;
> +module_param(tx_packets_coalesce, int, 0644);
> +
>  enum {
>  	VHOST_NET_VQ_RX = 0,
>  	VHOST_NET_VQ_TX = 1,
> @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net)
>  	int err, wmem;
>  	size_t hdr_size;
>  	struct socket *sock;
> +	int bytes_coalesced = 0;
> +	int bufs_coalesced = 0;
> +	int packets_coalesced = 0;
> 
>  	/* TODO: check that we are running from vhost_worker? */
>  	sock = rcu_dereference_check(vq->private_data, 1);
> @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net)
>  		if (err != len)
>  			pr_debug("Truncated TX packet: "
>  				 " len %d != %zd\n", err, len);
> -		vhost_add_used_and_signal(&net->dev, vq, head, 0);
>  		total_len += len;
> +		packets_coalesced += 1;
> +		bytes_coalesced += len;
> +		bufs_coalesced += out;
> +		if (unlikely(packets_coalesced > tx_packets_coalesce ||
> +			     bytes_coalesced > tx_bytes_coalesce ||
> +			     bufs_coalesced > tx_bufs_coalesce))
> +			vhost_add_used_and_signal(&net->dev, vq, head, 0);

I think the counters that exceed the limits need to be reset to 0 here.
Otherwise we keep signaling for every buffer once we hit this condition.

Thanks
Sridhar

> +		else
> +			vhost_add_used(vq, head, 0);
>  		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
>  			vhost_poll_queue(&vq->poll);
>  			break;
>  		}
>  	}
> 
> +	if (likely(packets_coalesced &&
> +		   bytes_coalesced &&
> +		   bufs_coalesced))
> +		vhost_signal(&net->dev, vq);
>  	mutex_unlock(&vq->mutex);
>  }
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* Re: Network performance with small packets
From: Shirley Ma @ 2011-02-01 22:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sridhar Samudrala, Steve Dobbelstein, David Miller, kvm, mashirle,
	netdev
In-Reply-To: <20110201215603.GA31348@redhat.com>

On Tue, 2011-02-01 at 23:56 +0200, Michael S. Tsirkin wrote:
> There are flags for bytes, buffers and packets.
> Try playing with any one of them :)
> Just be sure to use v2.
> 
> 
> >I would like to change it to
> > half of the ring size instead for signaling. Is that OK?
> > 
> > Shirley
> > 
> > 
> 
> Sure that is why I made it a parameter so you can experiment. 

The initial test results shows that the CPUs utilization has been
reduced some, and BW has increased some with the default parameters,
like 1K message size BW goes from 2.5Gb/s about 2.8Gb/s, CPU utilization
down from 4x% to 38%, (Similar results from the patch I submitted a
while ago to reduce signaling on vhost) but far away from dropping
packet results.

I am going to change the code to use 1/2 ring size to wake the netif
queue.

Shirley


^ permalink raw reply

* Re: [PATCH] include/net/genetlink.h: Allow genlmsg_cancel to accept a NULL argument
From: David Miller @ 2011-02-01 22:54 UTC (permalink / raw)
  To: julia; +Cc: netdev, linux-kernel, paul.moore, kernel-janitors
In-Reply-To: <Pine.LNX.4.64.1101281642080.8546@pc-004.diku.dk>

From: Julia Lawall <julia@diku.dk>
Date: Fri, 28 Jan 2011 16:43:40 +0100 (CET)

> nlmsg_cancel can accept NULL as its second argument, so for similarity,
> this patch extends genlmsg_cancel to be able to accept a NULL second
> argument as well.
> 
> Signed-off-by: Julia Lawall <julia@diku.dk>

I did a scan of all of the cases where this interface is used, and
I cannot find a situation where this capability would even be useful.

The use pattern is always:

	hdr = genlmsg_put(skb, ...);
	if (!hdr)
		goto out;

	NLA_PUT_*();
	NLA_PUT_*();
	....

	return genlmsg_end(skb, hdr);

nla_put_failure:
	genlmsg_cancel(skb, hdr);
out:
	return -EWHATEVER;

Always, hdr will be non-NULL.

We have to allocate the header first, then put the netlink
attributes.

Looking over users of nlmsg_cancel(), the situation seems to
match identically.

Therefore, it seems to me that it makes more sense to remove
the NULL check from nlmsg_cancel() than to add the NULL check
to genlmsg_cancel().

Thanks.

^ permalink raw reply

* AVB support (IEEE802.1 audio/video bridging)
From: Eliot Blennerhassett @ 2011-02-01 22:47 UTC (permalink / raw)
  To: netdev

Grettings,

before I go into any details, please tell me if this is the right place
to enquire/discuss if/how linux network stacks can support the various
protocols required by AVB

To quote the introduction from wikipedia [1]
"Audio Video Bridging (AVB) is a common name for the set of standards in
development by the IEEE 802.1 Audio Video Bridging Task Group. The
charter of this organization is to "provide the specifications that will
allow time-synchronized low latency streaming services through IEEE 802
networks"."

regards

-- 
Eliot Blennerhassett
AudioScience Inc.

[1] http://en.wikipedia.org/wiki/Audio_Video_Bridging

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox