Netdev List
 help / color / mirror / Atom feed
* [PATCH] tun: use netif_receive_skb instead of netif_rx_ni
From: Qin Chuanyu @ 2014-02-11 14:25 UTC (permalink / raw)
  To: davem
  Cc: jasowang, Michael S. Tsirkin, Anthony Liguori, KVM list, netdev,
	Eric Dumazet

we could xmit directly instead of going through softirq to gain 
throughput and lantency improved.
test model: VM-Host-Host just do transmit. with vhost thread and nic
interrupt bind cpu1. netperf do throuhput test and qperf do lantency test.
Host OS: suse11sp3, Guest OS: suse11sp3

latency result(us):
packet_len 64 256 512 1460
old(UDP)   44  47  48   66
new(UDP)   38  41  42   66

old(TCP)   52  55  70  117
new(TCP)   45  48  61  114

throughput result(Gbit/s):
packet_len   64   512   1024   1460
old(UDP)   0.42  2.02   3.75   4.68
new(UDP)   0.45  2.14   3.77   5.06

TCP due to the latency, client couldn't send packet big enough
to get benefit from TSO of nic, so the result show it will send
more packet per sencond but get lower throughput.

Eric mentioned that it would has problem with cgroup, but the patch
had been sent by Herbert Xu.
patch_id f845172531fb7410c7fb7780b1a6e51ee6df7d52

Signed-off-by: Chuanyu Qin <qinchuanyu@huawei.com>
---
  drivers/net/tun.c |    4 +++-
  1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 44c4db8..90b4e58 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1184,7 +1184,9 @@ static ssize_t tun_get_user(struct tun_struct 
*tun, struct tun_file *tfile,
  	skb_probe_transport_header(skb, 0);

  	rxhash = skb_get_hash(skb);
-	netif_rx_ni(skb);
+	rcu_read_lock_bh();
+	netif_receive_skb(skb);
+	rcu_read_unlock_bh();

  	tun->dev->stats.rx_packets++;
  	tun->dev->stats.rx_bytes += len;
-- 
1.7.3.1.msysgit.0

^ permalink raw reply related

* [PATCH] net: rtnetlink: fix correct size given to memset
From: Francois-Xavier Le Bail @ 2014-02-11 14:06 UTC (permalink / raw)
  To: NETDEV, David Miller

Find by cppcheck:
[net/core/rtnetlink.c:1842]: (warning) Using size of pointer linkinfo instead of size of its data.

Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com>
---

The diagnosis of cppcheck seems relevant.

 net/core/rtnetlink.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 048dc8d..9a5dbf1 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1839,7 +1839,7 @@ replay:
 		if (err < 0)
 			return err;
 	} else
-		memset(linkinfo, 0, sizeof(linkinfo));
+		memset(linkinfo, 0, sizeof(*linkinfo));
 
 	if (linkinfo[IFLA_INFO_KIND]) {
 		nla_strlcpy(kind, linkinfo[IFLA_INFO_KIND], sizeof(kind));

^ permalink raw reply related

* Re: [PATCH v2] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer
From: Vlad Yasevich @ 2014-02-11 14:52 UTC (permalink / raw)
  To: Matija Glavinic Pecotic, linux-sctp@vger.kernel.org
  Cc: netdev@vger.kernel.org, Alexander Sverdlin
In-Reply-To: <52F72AFE.4060802@nsn.com>

Hi Matija

On 02/09/2014 02:15 AM, Matija Glavinic Pecotic wrote:
>
> Proposed solution:
>
> Both problems share the same root cause, and that is improper scaling
of socket
> buffer with rwnd. Solution in which sizeof(sk_buff) is taken into
concern while
> calculating rwnd is not possible due to fact that there is no linear
> relationship between amount of data blamed in increase/decrease with
IP packet
> in which payload arrived. Even in case such solution would be followed,
> complexity of the code would increase. Due to nature of current rwnd
handling,
> slow increase (in sctp_assoc_rwnd_increase) of rwnd after pressure
state is
> entered is rationale, but it gives false representation to the sender
of current
> buffer space. Furthermore, it implements additional congestion control
mechanism
> which is defined on implementation, and not on standard basis.
>
> Proposed solution simplifies whole algorithm having on mind definition
from rfc:
>
> o  Receiver Window (rwnd): This gives the sender an indication of the
space
>    available in the receiver's inbound buffer.
>
> Core of the proposed solution is given with these lines:
>
> sctp_assoc_rwnd_update:
> 	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
> 		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
> 	else
> 		asoc->rwnd = 0;
>
> We advertise to sender (half of) actual space we have. Half is in the
braces
> depending whether you would like to observe size of socket buffer as
SO_RECVBUF
> or twice the amount, i.e. size is the one visible from userspace, that is,
> from kernelspace.
> In this way sender is given with good approximation of our buffer space,
> regardless of the buffer policy - we always advertise what we have.
Proposed
> solution fixes described problems and removes necessity for rwnd
restoration
> algorithm. Finally, as proposed solution is simplification, some lines
of code,
> along with some bytes in struct sctp_association are saved.
>
> Signed-off-by: Matija Glavinic Pecotic
<matija.glavinic-pecotic.ext@nsn.com>
> Reviewed-by: Alexander Sverdlin <alexander.sverdlin@nsn.com>
>
> --- net-next.orig/net/sctp/associola.c
> +++ net-next/net/sctp/associola.c
> @@ -1367,44 +1367,35 @@ static inline bool sctp_peer_needs_updat
>  	return false;
>  }
>
> -/* Increase asoc's rwnd by len and send any window update SACK if
needed. */
> -void sctp_assoc_rwnd_increase(struct sctp_association *asoc, unsigned
int len)
> +/* Update asoc's rwnd for the approximated state in the buffer,
> + * and check whether SACK needs to be sent.
> + */
> +void sctp_assoc_rwnd_update(struct sctp_association *asoc, bool
update_peer)
>  {
> +	int rx_count;
>  	struct sctp_chunk *sack;
>  	struct timer_list *timer;
>
> -	if (asoc->rwnd_over) {
> -		if (asoc->rwnd_over >= len) {
> -			asoc->rwnd_over -= len;
> -		} else {
> -			asoc->rwnd += (len - asoc->rwnd_over);
> -			asoc->rwnd_over = 0;
> -		}
> -	} else {
> -		asoc->rwnd += len;
> -	}
> +	if (asoc->ep->rcvbuf_policy)
> +		rx_count = atomic_read(&asoc->rmem_alloc);
> +	else
> +		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
>
> -	/* If we had window pressure, start recovering it
> -	 * once our rwnd had reached the accumulated pressure
> -	 * threshold.  The idea is to recover slowly, but up
> -	 * to the initial advertised window.
> -	 */
> -	if (asoc->rwnd_press && asoc->rwnd >= asoc->rwnd_press) {
> -		int change = min(asoc->pathmtu, asoc->rwnd_press);
> -		asoc->rwnd += change;
> -		asoc->rwnd_press -= change;
> -	}
> +	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
> +		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
> +	else
> +		asoc->rwnd = 0;
>
> -	pr_debug("%s: asoc:%p rwnd increased by %d to (%u, %u) - %u\n",
> -		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
> -		 asoc->a_rwnd);
> +	pr_debug("%s: asoc:%p rwnd=%u, rx_count=%d, sk_rcvbuf=%d\n",
> +		 __func__, asoc, asoc->rwnd, rx_count,
> +		 asoc->base.sk->sk_rcvbuf);
>
>  	/* Send a window update SACK if the rwnd has increased by at least the
>  	 * minimum of the association's PMTU and half of the receive buffer.
>  	 * The algorithm used is similar to the one described in
>  	 * Section 4.2.3.3 of RFC 1122.
>  	 */
> -	if (sctp_peer_needs_update(asoc)) {
> +	if (update_peer && sctp_peer_needs_update(asoc)) {
>  		asoc->a_rwnd = asoc->rwnd;
>
>  		pr_debug("%s: sending window update SACK- asoc:%p rwnd:%u "
> @@ -1426,45 +1417,6 @@ void sctp_assoc_rwnd_increase(struct sct
>  	}
>  }
>
> -/* Decrease asoc's rwnd by len. */
> -void sctp_assoc_rwnd_decrease(struct sctp_association *asoc, unsigned
int len)
> -{
> -	int rx_count;
> -	int over = 0;
> -
> -	if (unlikely(!asoc->rwnd || asoc->rwnd_over))
> -		pr_debug("%s: association:%p has asoc->rwnd:%u, "
> -			 "asoc->rwnd_over:%u!\n", __func__, asoc,
> -			 asoc->rwnd, asoc->rwnd_over);
> -
> -	if (asoc->ep->rcvbuf_policy)
> -		rx_count = atomic_read(&asoc->rmem_alloc);
> -	else
> -		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
> -
> -	/* If we've reached or overflowed our receive buffer, announce
> -	 * a 0 rwnd if rwnd would still be positive.  Store the
> -	 * the potential pressure overflow so that the window can be restored
> -	 * back to original value.
> -	 */
> -	if (rx_count >= asoc->base.sk->sk_rcvbuf)
> -		over = 1;
> -
> -	if (asoc->rwnd >= len) {
> -		asoc->rwnd -= len;
> -		if (over) {
> -			asoc->rwnd_press += asoc->rwnd;
> -			asoc->rwnd = 0;
> -		}
> -	} else {
> -		asoc->rwnd_over = len - asoc->rwnd;
> -		asoc->rwnd = 0;
> -	}
> -
> -	pr_debug("%s: asoc:%p rwnd decreased by %d to (%u, %u, %u)\n",
> -		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
> -		 asoc->rwnd_press);
> -}
>
>  /* Build the bind address list for the association based on info from the
>   * local endpoint and the remote peer.
> --- net-next.orig/include/net/sctp/structs.h
> +++ net-next/include/net/sctp/structs.h
> @@ -1653,17 +1653,6 @@ struct sctp_association {
>  	/* This is the last advertised value of rwnd over a SACK chunk. */
>  	__u32 a_rwnd;
>
> -	/* Number of bytes by which the rwnd has slopped.  The rwnd is allowed
> -	 * to slop over a maximum of the association's frag_point.
> -	 */
> -	__u32 rwnd_over;
> -
> -	/* Keeps treack of rwnd pressure.  This happens when we have
> -	 * a window, but not recevie buffer (i.e small packets).  This one
> -	 * is releases slowly (1 PMTU at a time ).
> -	 */
> -	__u32 rwnd_press;
> -
>  	/* This is the sndbuf size in use for the association.
>  	 * This corresponds to the sndbuf size for the association,
>  	 * as specified in the sk->sndbuf.
> @@ -1892,8 +1881,7 @@ void sctp_assoc_update(struct sctp_assoc
>  __u32 sctp_association_get_next_tsn(struct sctp_association *);
>
>  void sctp_assoc_sync_pmtu(struct sock *, struct sctp_association *);
> -void sctp_assoc_rwnd_increase(struct sctp_association *, unsigned int);
> -void sctp_assoc_rwnd_decrease(struct sctp_association *, unsigned int);
> +void sctp_assoc_rwnd_update(struct sctp_association *, bool);
>  void sctp_assoc_set_primary(struct sctp_association *,
>  			    struct sctp_transport *);
>  void sctp_assoc_del_nonprimary_peers(struct sctp_association *,
> --- net-next.orig/net/sctp/sm_statefuns.c
> +++ net-next/net/sctp/sm_statefuns.c
> @@ -6176,7 +6176,7 @@ static int sctp_eat_data(const struct sc
>  	 * PMTU.  In cases, such as loopback, this might be a rather
>  	 * large spill over.
>  	 */
> -	if ((!chunk->data_accepted) && (!asoc->rwnd || asoc->rwnd_over ||
> +	if ((!chunk->data_accepted) && (!asoc->rwnd ||
>  	    (datalen > asoc->rwnd + asoc->frag_point))) {
>
>  		/* If this is the next TSN, consider reneging to make
> --- net-next.orig/net/sctp/socket.c
> +++ net-next/net/sctp/socket.c
> @@ -2092,12 +2092,6 @@ static int sctp_recvmsg(struct kiocb *io
>  		sctp_skb_pull(skb, copied);
>  		skb_queue_head(&sk->sk_receive_queue, skb);
>
> -		/* When only partial message is copied to the user, increase
> -		 * rwnd by that amount. If all the data in the skb is read,
> -		 * rwnd is updated when the event is freed.
> -		 */
> -		if (!sctp_ulpevent_is_notification(event))
> -			sctp_assoc_rwnd_increase(event->asoc, copied);
>  		goto out;
>  	} else if ((event->msg_flags & MSG_NOTIFICATION) ||
>  		   (event->msg_flags & MSG_EOR))
> --- net-next.orig/net/sctp/ulpevent.c
> +++ net-next/net/sctp/ulpevent.c
> @@ -989,7 +989,7 @@ static void sctp_ulpevent_receive_data(s
>  	skb = sctp_event2skb(event);
>  	/* Set the owner and charge rwnd for bytes received.  */
>  	sctp_ulpevent_set_owner(event, asoc);
> -	sctp_assoc_rwnd_decrease(asoc, skb_headlen(skb));
> +	sctp_assoc_rwnd_update(asoc, false);
>
>  	if (!skb->data_len)
>  		return;
> @@ -1035,8 +1035,9 @@ static void sctp_ulpevent_release_data(s
>  	}
>
>  done:
> -	sctp_assoc_rwnd_increase(event->asoc, len);
> -	sctp_ulpevent_release_owner(event);
> +	atomic_sub(event->rmem_len, &event->asoc->rmem_alloc);
> +	sctp_assoc_rwnd_update(event->asoc, true);
> +	sctp_association_put(event->asoc)

Can't we simply change the order of window update and release instead
of open coding it like this?


-vlad

>  }
>
>  static void sctp_ulpevent_release_frag_data(struct sctp_ulpevent *event)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* [PATCH] netfilter: nf_nat_snmp_basic: fix duplicates in if/else branches
From: Francois-Xavier Le Bail @ 2014-02-11 14:49 UTC (permalink / raw)
  To: NETDEV, David Miller, Patrick McHardy, netfilter-devel, netfilter,
	coreteam

The solution was found by Patrick in 2.4 kernel sources.

Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com>
---
 net/ipv4/netfilter/nf_nat_snmp_basic.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/netfilter/nf_nat_snmp_basic.c b/net/ipv4/netfilter/nf_nat_snmp_basic.c
index d551e31..7c67667 100644
--- a/net/ipv4/netfilter/nf_nat_snmp_basic.c
+++ b/net/ipv4/netfilter/nf_nat_snmp_basic.c
@@ -1198,8 +1198,8 @@ static int snmp_translate(struct nf_conn *ct,
 		map.to = NOCT1(&ct->tuplehash[!dir].tuple.dst.u3.ip);
 	} else {
 		/* DNAT replies */
-		map.from = NOCT1(&ct->tuplehash[dir].tuple.src.u3.ip);
-		map.to = NOCT1(&ct->tuplehash[!dir].tuple.dst.u3.ip);
+		map.from = NOCT1(&ct->tuplehash[!dir].tuple.src.u3.ip);
+		map.to = NOCT1(&ct->tuplehash[dir].tuple.dst.u3.ip);
 	}
 
 	if (map.from == map.to)

^ permalink raw reply related

* Re: [PATCH] tun: use netif_receive_skb instead of netif_rx_ni
From: Eric Dumazet @ 2014-02-11 15:24 UTC (permalink / raw)
  To: Qin Chuanyu
  Cc: davem, jasowang, Michael S. Tsirkin, Anthony Liguori, KVM list,
	netdev
In-Reply-To: <52FA32C5.9040601@huawei.com>

On Tue, 2014-02-11 at 22:25 +0800, Qin Chuanyu wrote:
> we could xmit directly instead of going through softirq to gain 
> throughput and lantency improved.
> test model: VM-Host-Host just do transmit. with vhost thread and nic
> interrupt bind cpu1. netperf do throuhput test and qperf do lantency test.
> Host OS: suse11sp3, Guest OS: suse11sp3
> 
> latency result(us):
> packet_len 64 256 512 1460
> old(UDP)   44  47  48   66
> new(UDP)   38  41  42   66
> 
> old(TCP)   52  55  70  117
> new(TCP)   45  48  61  114
> 
> throughput result(Gbit/s):
> packet_len   64   512   1024   1460
> old(UDP)   0.42  2.02   3.75   4.68
> new(UDP)   0.45  2.14   3.77   5.06
> 
> TCP due to the latency, client couldn't send packet big enough
> to get benefit from TSO of nic, so the result show it will send
> more packet per sencond but get lower throughput.
> 
> Eric mentioned that it would has problem with cgroup, but the patch
> had been sent by Herbert Xu.
> patch_id f845172531fb7410c7fb7780b1a6e51ee6df7d52
> 
> Signed-off-by: Chuanyu Qin <qinchuanyu@huawei.com>
> ---
>   drivers/net/tun.c |    4 +++-
>   1 files changed, 3 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 44c4db8..90b4e58 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1184,7 +1184,9 @@ static ssize_t tun_get_user(struct tun_struct 
> *tun, struct tun_file *tfile,
>   	skb_probe_transport_header(skb, 0);
> 
>   	rxhash = skb_get_hash(skb);
> -	netif_rx_ni(skb);
> +	rcu_read_lock_bh();
> +	netif_receive_skb(skb);
> +	rcu_read_unlock_bh();
> 
>   	tun->dev->stats.rx_packets++;
>   	tun->dev->stats.rx_bytes += len;

I already said this patch is not good :

rcu_read_lock_bh() makes no sense here.

What is really needed is local_bh_disable();

Herbert patch ( http://patchwork.ozlabs.org/patch/52963/ ) had a much
cleaner form.

Just use it, CC him, credit him, please ?




^ permalink raw reply

* Re: [PATCH] net: rtnetlink: fix correct size given to memset
From: Eric Dumazet @ 2014-02-11 15:28 UTC (permalink / raw)
  To: Francois-Xavier Le Bail; +Cc: NETDEV, David Miller
In-Reply-To: <1392127598-3719-1-git-send-email-fx.lebail@yahoo.com>

On Tue, 2014-02-11 at 15:06 +0100, Francois-Xavier Le Bail wrote:
> Find by cppcheck:
> [net/core/rtnetlink.c:1842]: (warning) Using size of pointer linkinfo instead of size of its data.
> 
> Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com>
> ---
> 
> The diagnosis of cppcheck seems relevant.
> 
>  net/core/rtnetlink.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> index 048dc8d..9a5dbf1 100644
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -1839,7 +1839,7 @@ replay:
>  		if (err < 0)
>  			return err;
>  	} else
> -		memset(linkinfo, 0, sizeof(linkinfo));
> +		memset(linkinfo, 0, sizeof(*linkinfo));
>  
>  	if (linkinfo[IFLA_INFO_KIND]) {
>  		nla_strlcpy(kind, linkinfo[IFLA_INFO_KIND], sizeof(kind));


This is absolutely wrong.

	struct nlattr *linkinfo[IFLA_INFO_MAX+1];

sizeof(linkinfo) is totally appropriate 

sizeof(*linkinfo) is equal to sizeof(void *), and this is not what we
need here.

^ permalink raw reply

* [PATCH v2 0/2] dp83640: Get pin and master/slave configuration from DT
From: Stefan Sørensen @ 2014-02-11 15:29 UTC (permalink / raw)
  To: richardcochran, grant.likely, robh+dt, mark.rutland, netdev,
	linux-kernel, devicetree
  Cc: Stefan Sørensen

This patch series add DT configuration to the DP83640 PHY driver and makes
the configuration of periodic output pins dynamic.

Changes since v1:
 - Add bindings documentation
 - Keep module parameters
 - Rename gpio->pin
 - Split patch into DT part and dynamic periodic output support

Stefan


Stefan Sørensen (2):
  dp83640: Support a configurable number of periodic outputs
  dp83640: Get pin and master/slave configuration from DT

 Documentation/devicetree/bindings/net/dp83640.txt |  29 +++
 drivers/net/phy/dp83640.c                         | 205 +++++++++++++++-------
 2 files changed, 175 insertions(+), 59 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/net/dp83640.txt

-- 
1.8.5.3

^ permalink raw reply

* [PATCH v2 1/2] dp83640: Support a configurable number of periodic outputs
From: Stefan Sørensen @ 2014-02-11 15:29 UTC (permalink / raw)
  To: richardcochran, grant.likely, robh+dt, mark.rutland, netdev,
	linux-kernel, devicetree
  Cc: Stefan Sørensen
In-Reply-To: <1392132562-23644-1-git-send-email-stefan.sorensen@spectralink.com>

The driver is currently limited to a single periodic output. This patch makes
the number of peridodic output dynamic by dropping the gpio_tab module
parameter and adding calibrate_pin, perout_pins, and extts_pins parameters.

Signed-off-by: Stefan Sørensen <stefan.sorensen@spectralink.com>
---
 drivers/net/phy/dp83640.c | 75 ++++++++++++++++++++---------------------------
 1 file changed, 32 insertions(+), 43 deletions(-)

diff --git a/drivers/net/phy/dp83640.c b/drivers/net/phy/dp83640.c
index 547725f..d4fe95d 100644
--- a/drivers/net/phy/dp83640.c
+++ b/drivers/net/phy/dp83640.c
@@ -38,15 +38,11 @@
 #define LAYER4		0x02
 #define LAYER2		0x01
 #define MAX_RXTS	64
-#define N_EXT_TS	6
+#define N_EXT		8
 #define PSF_PTPVER	2
 #define PSF_EVNT	0x4000
 #define PSF_RX		0x2000
 #define PSF_TX		0x1000
-#define EXT_EVENT	1
-#define CAL_EVENT	7
-#define CAL_TRIGGER	7
-#define PER_TRIGGER	6
 
 #define MII_DP83640_MICR 0x11
 #define MII_DP83640_MISR 0x12
@@ -146,32 +142,24 @@ struct dp83640_clock {
 	struct ptp_clock *ptp_clock;
 };
 
-/* globals */
-
-enum {
-	CALIBRATE_GPIO,
-	PEROUT_GPIO,
-	EXTTS0_GPIO,
-	EXTTS1_GPIO,
-	EXTTS2_GPIO,
-	EXTTS3_GPIO,
-	EXTTS4_GPIO,
-	EXTTS5_GPIO,
-	GPIO_TABLE_SIZE
-};
 
 static int chosen_phy = -1;
-static ushort gpio_tab[GPIO_TABLE_SIZE] = {
-	1, 2, 3, 4, 8, 9, 10, 11
-};
+static int calibrate_pin = 1;
+static int perout_pins[N_EXT] = {2};
+static int n_per_out = 1;
+static int extts_pins[N_EXT] = {3, 4, 8, 9, 10, 11};
+static int n_ext_ts = 6;
 
 module_param(chosen_phy, int, 0444);
-module_param_array(gpio_tab, ushort, NULL, 0444);
+module_param(calibrate_pin, int, 0444);
+module_param_array(perout_pins, int, &n_per_out, 0444);
+module_param_array(extts_pins, int, &n_ext_ts, 0444);
 
 MODULE_PARM_DESC(chosen_phy, \
 	"The address of the PHY to use for the ancillary clock features");
-MODULE_PARM_DESC(gpio_tab, \
-	"Which GPIO line to use for which purpose: cal,perout,extts1,...,extts6");
+MODULE_PARM_DESC(calibrate_pin, "Which pin to use for calibration");
+MODULE_PARM_DESC(perout_pins, "Which pins to use for periodic output");
+MODULE_PARM_DESC(extts_pins, "Which pins to use for external timestamping");
 
 /* a list of clocks and a mutex to protect it */
 static LIST_HEAD(phyter_clocks);
@@ -267,15 +255,16 @@ static u64 phy2txts(struct phy_txts *p)
 }
 
 static void periodic_output(struct dp83640_clock *clock,
-			    struct ptp_clock_request *clkreq, bool on)
+			    struct ptp_clock_request *clkreq, int index,
+			    bool on)
 {
 	struct dp83640_private *dp83640 = clock->chosen;
 	struct phy_device *phydev = dp83640->phydev;
 	u32 sec, nsec, period;
 	u16 gpio, ptp_trig, trigger, val;
 
-	gpio = on ? gpio_tab[PEROUT_GPIO] : 0;
-	trigger = PER_TRIGGER;
+	gpio = on ? perout_pins[index] : 0;
+	trigger = n_ext_ts + index;
 
 	ptp_trig = TRIG_WR |
 		(trigger & TRIG_CSEL_MASK) << TRIG_CSEL_SHIFT |
@@ -430,12 +419,12 @@ static int ptp_dp83640_enable(struct ptp_clock_info *ptp,
 	switch (rq->type) {
 	case PTP_CLK_REQ_EXTTS:
 		index = rq->extts.index;
-		if (index < 0 || index >= N_EXT_TS)
+		if (index < 0 || index >= n_ext_ts)
 			return -EINVAL;
-		event_num = EXT_EVENT + index;
+		event_num = index;
 		evnt = EVNT_WR | (event_num & EVNT_SEL_MASK) << EVNT_SEL_SHIFT;
 		if (on) {
-			gpio_num = gpio_tab[EXTTS0_GPIO + index];
+			gpio_num = extts_pins[index];
 			evnt |= (gpio_num & EVNT_GPIO_MASK) << EVNT_GPIO_SHIFT;
 			evnt |= EVNT_RISE;
 		}
@@ -443,9 +432,10 @@ static int ptp_dp83640_enable(struct ptp_clock_info *ptp,
 		return 0;
 
 	case PTP_CLK_REQ_PEROUT:
-		if (rq->perout.index != 0)
+		index = rq->perout.index;
+		if (index < 0 || index >= n_per_out)
 			return -EINVAL;
-		periodic_output(clock, rq, on);
+		periodic_output(clock, rq, index, on);
 		return 0;
 
 	default:
@@ -538,10 +528,9 @@ static void recalibrate(struct dp83640_clock *clock)
 	struct list_head *this;
 	struct dp83640_private *tmp;
 	struct phy_device *master = clock->chosen->phydev;
-	u16 cal_gpio, cfg0, evnt, ptp_trig, trigger, val;
+	u16 cfg0, evnt, ptp_trig, trigger, val;
 
-	trigger = CAL_TRIGGER;
-	cal_gpio = gpio_tab[CALIBRATE_GPIO];
+	trigger = n_ext_ts + n_per_out;
 
 	mutex_lock(&clock->extreg_lock);
 
@@ -564,8 +553,8 @@ static void recalibrate(struct dp83640_clock *clock)
 	 * enable an event timestamp
 	 */
 	evnt = EVNT_WR | EVNT_RISE | EVNT_SINGLE;
-	evnt |= (CAL_EVENT & EVNT_SEL_MASK) << EVNT_SEL_SHIFT;
-	evnt |= (cal_gpio & EVNT_GPIO_MASK) << EVNT_GPIO_SHIFT;
+	evnt |= (trigger & EVNT_SEL_MASK) << EVNT_SEL_SHIFT;
+	evnt |= (calibrate_pin & EVNT_GPIO_MASK) << EVNT_GPIO_SHIFT;
 
 	list_for_each(this, &clock->phylist) {
 		tmp = list_entry(this, struct dp83640_private, list);
@@ -578,7 +567,7 @@ static void recalibrate(struct dp83640_clock *clock)
 	 */
 	ptp_trig = TRIG_WR | TRIG_IF_LATE | TRIG_PULSE;
 	ptp_trig |= (trigger  & TRIG_CSEL_MASK) << TRIG_CSEL_SHIFT;
-	ptp_trig |= (cal_gpio & TRIG_GPIO_MASK) << TRIG_GPIO_SHIFT;
+	ptp_trig |= (calibrate_pin & TRIG_GPIO_MASK) << TRIG_GPIO_SHIFT;
 	ext_write(0, master, PAGE5, PTP_TRIG, ptp_trig);
 
 	/* load trigger */
@@ -642,7 +631,7 @@ static void recalibrate(struct dp83640_clock *clock)
 
 static inline u16 exts_chan_to_edata(int ch)
 {
-	return 1 << ((ch + EXT_EVENT) * 2);
+	return 1 << ((ch) * 2);
 }
 
 static int decode_evnt(struct dp83640_private *dp83640,
@@ -676,14 +665,14 @@ static int decode_evnt(struct dp83640_private *dp83640,
 		parsed = words + 2;
 	} else {
 		parsed = words + 1;
-		i = ((ests >> EVNT_NUM_SHIFT) & EVNT_NUM_MASK) - EXT_EVENT;
+		i = ((ests >> EVNT_NUM_SHIFT) & EVNT_NUM_MASK);
 		ext_status = exts_chan_to_edata(i);
 	}
 
 	event.type = PTP_CLOCK_EXTTS;
 	event.timestamp = phy2txts(&dp83640->edata);
 
-	for (i = 0; i < N_EXT_TS; i++) {
+	for (i = 0; i < n_ext_ts; i++) {
 		if (ext_status & exts_chan_to_edata(i)) {
 			event.index = i;
 			ptp_clock_event(dp83640->clock->ptp_clock, &event);
@@ -889,8 +878,8 @@ static void dp83640_clock_init(struct dp83640_clock *clock, struct mii_bus *bus)
 	sprintf(clock->caps.name, "dp83640 timer");
 	clock->caps.max_adj	= 1953124;
 	clock->caps.n_alarm	= 0;
-	clock->caps.n_ext_ts	= N_EXT_TS;
-	clock->caps.n_per_out	= 1;
+	clock->caps.n_ext_ts	= n_ext_ts;
+	clock->caps.n_per_out	= n_per_out;
 	clock->caps.pps		= 0;
 	clock->caps.adjfreq	= ptp_dp83640_adjfreq;
 	clock->caps.adjtime	= ptp_dp83640_adjtime;
-- 
1.8.5.3

^ permalink raw reply related

* [PATCH v2 2/2] dp83640: Get pin and master/slave configuration from DT
From: Stefan Sørensen @ 2014-02-11 15:29 UTC (permalink / raw)
  To: richardcochran, grant.likely, robh+dt, mark.rutland, netdev,
	linux-kernel, devicetree
  Cc: Stefan Sørensen
In-Reply-To: <1392132562-23644-1-git-send-email-stefan.sorensen@spectralink.com>

This patch adds configuration of the periodic output and external timestamp
pins available on the dp83640 family of PHYs. It also configures the
master/slave relationship in a group of PHYs on the same MDIO bus and the pins
used for clock calibration in the group.

The configuration is retrieved from DT through the properties
    dp83640,slave
    dp83640,calibrate-pin
    dp83640,perout-pins
    dp83640,extts-pins
The configuration module parameters are retained as fallback for the non-DT
case.

Since the pin configuration is now stored for each clock device, groups of
devices on different mdio busses can now have different pin configurations.

Signed-off-by: Stefan Sørensen <stefan.sorensen@spectralink.com>
---
 Documentation/devicetree/bindings/net/dp83640.txt |  29 +++++
 drivers/net/phy/dp83640.c                         | 152 ++++++++++++++++++----
 2 files changed, 154 insertions(+), 27 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/net/dp83640.txt

diff --git a/Documentation/devicetree/bindings/net/dp83640.txt b/Documentation/devicetree/bindings/net/dp83640.txt
new file mode 100644
index 0000000..b9a57c0
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/dp83640.txt
@@ -0,0 +1,29 @@
+Required properties for the National DP83640 ethernet phy:
+
+- compatible : Must contain "national,dp83640"
+
+Optional properties:
+
+- dp83640,slave: If present, this phy will be slave to another dp83640
+  on the same mdio bus.
+- dp83640,perout-pins : List of the pin pins used for periodic output
+  triggers.
+- dp83640,extts-pins : List of the pin pins used for external event
+  timestamping.
+- dp83640,calibrate-pin : The pin used for master/slave calibration.
+
+Example:
+
+	ethernet-phy@1 {
+		compatible = "national,dp83640", "ethernet-phy-ieee802.3-c22";
+		reg = <1>;
+		dp83640,perout-pins = <2>;
+		dp83640,extts-pins = <3 4 8 9 10 11>;
+		dp83640,calibrate-pin = <1>;
+	};
+
+	ethernet-phy@2 {
+		compatible = "national,dp83640", "ethernet-phy-ieee802.3-c22";
+		reg = <2>;
+		dp83640,slave;
+	};
diff --git a/drivers/net/phy/dp83640.c b/drivers/net/phy/dp83640.c
index d4fe95d..a9bd553 100644
--- a/drivers/net/phy/dp83640.c
+++ b/drivers/net/phy/dp83640.c
@@ -30,6 +30,7 @@
 #include <linux/phy.h>
 #include <linux/ptp_classify.h>
 #include <linux/ptp_clock_kernel.h>
+#include <linux/of_device.h>
 
 #include "dp83640_reg.h"
 
@@ -119,6 +120,8 @@ struct dp83640_private {
 	/* queues of incoming and outgoing packets */
 	struct sk_buff_head rx_queue;
 	struct sk_buff_head tx_queue;
+	/* is this phyter a slave */
+	bool slave;
 };
 
 struct dp83640_clock {
@@ -140,6 +143,7 @@ struct dp83640_clock {
 	struct list_head phylist;
 	/* reference to our PTP hardware clock */
 	struct ptp_clock *ptp_clock;
+	u32 perout_pins[N_EXT], extts_pins[N_EXT], calibrate_pin;
 };
 
 
@@ -263,8 +267,8 @@ static void periodic_output(struct dp83640_clock *clock,
 	u32 sec, nsec, period;
 	u16 gpio, ptp_trig, trigger, val;
 
-	gpio = on ? perout_pins[index] : 0;
-	trigger = n_ext_ts + index;
+	gpio = on ? clock->perout_pins[index] : 0;
+	trigger = clock->caps.n_ext_ts + index;
 
 	ptp_trig = TRIG_WR |
 		(trigger & TRIG_CSEL_MASK) << TRIG_CSEL_SHIFT |
@@ -419,12 +423,12 @@ static int ptp_dp83640_enable(struct ptp_clock_info *ptp,
 	switch (rq->type) {
 	case PTP_CLK_REQ_EXTTS:
 		index = rq->extts.index;
-		if (index < 0 || index >= n_ext_ts)
+		if (index < 0 || index >= clock->caps.n_ext_ts)
 			return -EINVAL;
 		event_num = index;
 		evnt = EVNT_WR | (event_num & EVNT_SEL_MASK) << EVNT_SEL_SHIFT;
 		if (on) {
-			gpio_num = extts_pins[index];
+			gpio_num = clock->extts_pins[index];
 			evnt |= (gpio_num & EVNT_GPIO_MASK) << EVNT_GPIO_SHIFT;
 			evnt |= EVNT_RISE;
 		}
@@ -433,7 +437,7 @@ static int ptp_dp83640_enable(struct ptp_clock_info *ptp,
 
 	case PTP_CLK_REQ_PEROUT:
 		index = rq->perout.index;
-		if (index < 0 || index >= n_per_out)
+		if (index < 0 || index >= clock->caps.n_per_out)
 			return -EINVAL;
 		periodic_output(clock, rq, index, on);
 		return 0;
@@ -530,7 +534,7 @@ static void recalibrate(struct dp83640_clock *clock)
 	struct phy_device *master = clock->chosen->phydev;
 	u16 cfg0, evnt, ptp_trig, trigger, val;
 
-	trigger = n_ext_ts + n_per_out;
+	trigger = clock->caps.n_ext_ts + clock->caps.n_per_out;
 
 	mutex_lock(&clock->extreg_lock);
 
@@ -554,7 +558,7 @@ static void recalibrate(struct dp83640_clock *clock)
 	 */
 	evnt = EVNT_WR | EVNT_RISE | EVNT_SINGLE;
 	evnt |= (trigger & EVNT_SEL_MASK) << EVNT_SEL_SHIFT;
-	evnt |= (calibrate_pin & EVNT_GPIO_MASK) << EVNT_GPIO_SHIFT;
+	evnt |= (clock->calibrate_pin & EVNT_GPIO_MASK) << EVNT_GPIO_SHIFT;
 
 	list_for_each(this, &clock->phylist) {
 		tmp = list_entry(this, struct dp83640_private, list);
@@ -567,7 +571,7 @@ static void recalibrate(struct dp83640_clock *clock)
 	 */
 	ptp_trig = TRIG_WR | TRIG_IF_LATE | TRIG_PULSE;
 	ptp_trig |= (trigger  & TRIG_CSEL_MASK) << TRIG_CSEL_SHIFT;
-	ptp_trig |= (calibrate_pin & TRIG_GPIO_MASK) << TRIG_GPIO_SHIFT;
+	ptp_trig |= (clock->calibrate_pin & TRIG_GPIO_MASK) << TRIG_GPIO_SHIFT;
 	ext_write(0, master, PAGE5, PTP_TRIG, ptp_trig);
 
 	/* load trigger */
@@ -672,7 +676,7 @@ static int decode_evnt(struct dp83640_private *dp83640,
 	event.type = PTP_CLOCK_EXTTS;
 	event.timestamp = phy2txts(&dp83640->edata);
 
-	for (i = 0; i < n_ext_ts; i++) {
+	for (i = 0; i < dp83640->clock->caps.n_ext_ts; i++) {
 		if (ext_status & exts_chan_to_edata(i)) {
 			event.index = i;
 			ptp_clock_event(dp83640->clock->ptp_clock, &event);
@@ -874,12 +878,11 @@ static void dp83640_clock_init(struct dp83640_clock *clock, struct mii_bus *bus)
 	mutex_init(&clock->extreg_lock);
 	mutex_init(&clock->clock_lock);
 	INIT_LIST_HEAD(&clock->phylist);
+	clock->calibrate_pin = -1;
 	clock->caps.owner = THIS_MODULE;
 	sprintf(clock->caps.name, "dp83640 timer");
 	clock->caps.max_adj	= 1953124;
 	clock->caps.n_alarm	= 0;
-	clock->caps.n_ext_ts	= n_ext_ts;
-	clock->caps.n_per_out	= n_per_out;
 	clock->caps.pps		= 0;
 	clock->caps.adjfreq	= ptp_dp83640_adjfreq;
 	clock->caps.adjtime	= ptp_dp83640_adjtime;
@@ -892,18 +895,6 @@ static void dp83640_clock_init(struct dp83640_clock *clock, struct mii_bus *bus)
 	get_device(&bus->dev);
 }
 
-static int choose_this_phy(struct dp83640_clock *clock,
-			   struct phy_device *phydev)
-{
-	if (chosen_phy == -1 && !clock->chosen)
-		return 1;
-
-	if (chosen_phy == phydev->addr)
-		return 1;
-
-	return 0;
-}
-
 static struct dp83640_clock *dp83640_clock_get(struct dp83640_clock *clock)
 {
 	if (clock)
@@ -949,6 +940,95 @@ static void dp83640_clock_put(struct dp83640_clock *clock)
 	mutex_unlock(&clock->clock_lock);
 }
 
+#ifdef CONFIG_OF
+static int dp83640_probe_dt(struct device_node *node,
+			    struct dp83640_private *dp83640)
+{
+	struct dp83640_clock *clock = dp83640->clock;
+	struct property *prop;
+	int err, proplen;
+
+	dp83640->slave = of_property_read_bool(node, "dp83640,slave");
+	if (!dp83640->slave && clock->chosen) {
+		pr_err("dp83640,slave must be set if more than one device on the same bus");
+		return -EINVAL;
+	}
+
+	prop = of_find_property(node, "dp83640,perout-pins", &proplen);
+	if (prop) {
+		if (dp83640->slave) {
+			pr_err("dp83640,perout-pins property can not be set together with dp83640,slave");
+			return -EINVAL;
+		}
+
+		clock->caps.n_per_out = proplen / sizeof(u32);
+		if (clock->caps.n_per_out > N_EXT) {
+			pr_err("dp83640,perout-pins may not have more than %d entries",
+			       N_EXT);
+			return -EINVAL;
+		}
+		err = of_property_read_u32_array(node, "dp83640,perout-pins",
+						 clock->perout_pins,
+						 clock->caps.n_per_out);
+		if (err < 0)
+			return err;
+	}
+
+	prop = of_find_property(node, "dp83640,extts-pins", &proplen);
+	if (prop) {
+		if (dp83640->slave) {
+			pr_err("dp83640,extts-pins property can not be set together with dp83640,slave");
+			return -EINVAL;
+		}
+
+		clock->caps.n_ext_ts = proplen / sizeof(u32);
+		if (clock->caps.n_ext_ts > N_EXT) {
+			pr_err("dp83640,extts-pins may not have more than %d entries",
+			       N_EXT);
+			return -EINVAL;
+		}
+		err = of_property_read_u32_array(node, "dp83640,extts-pins",
+						 clock->extts_pins,
+						 clock->caps.n_ext_ts);
+		if (err < 0)
+			return err;
+	}
+
+	prop = of_find_property(node, "dp83640,calibrate-pin", &proplen);
+	if (prop) {
+		if (dp83640->slave) {
+			pr_err("dp83640,calibrate-pin property can not be set together with dp83640,slave");
+			return -EINVAL;
+		}
+		of_property_read_u32(node, "dp83640,calibrate-pin",
+				     &clock->calibrate_pin);
+	}
+
+	if (!dp83640->slave) {
+		if (clock->caps.n_per_out + clock->caps.n_ext_ts +
+		    (clock->calibrate_pin != -1 ? 1 : 0) > N_EXT) {
+			pr_err("Too many pins configured");
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+static const struct of_device_id dp83640_of_match_table[] = {
+	{ .compatible = "national,dp83640", },
+	{ },
+};
+MODULE_DEVICE_TABLE(of, dp83640_of_match_table);
+#else
+
+static inline int dp83640_probe_dt(struct device_node *node,
+				   struct dp83640_private *dp83640)
+{
+	return 0;
+}
+#endif
+
 static int dp83640_probe(struct phy_device *phydev)
 {
 	struct dp83640_clock *clock;
@@ -966,7 +1046,24 @@ static int dp83640_probe(struct phy_device *phydev)
 	if (!dp83640)
 		goto no_memory;
 
+	dp83640->clock = clock;
 	dp83640->phydev = phydev;
+
+	if (phydev->dev.of_node) {
+		err = dp83640_probe_dt(phydev->dev.of_node, dp83640);
+		if (err)
+			return err;
+	} else {
+		clock->calibrate_pin = calibrate_pin;
+		memcpy(clock->perout_pins, perout_pins,
+		       sizeof(clock->perout_pins));
+		memcpy(clock->extts_pins, extts_pins,
+		       sizeof(clock->extts_pins));
+		if (clock->chosen ||
+		    (chosen_phy != -1 && phydev->addr != chosen_phy))
+			dp83640->slave = true;
+	}
+
 	INIT_WORK(&dp83640->ts_work, rx_timestamp_work);
 
 	INIT_LIST_HEAD(&dp83640->rxts);
@@ -980,9 +1077,7 @@ static int dp83640_probe(struct phy_device *phydev)
 	skb_queue_head_init(&dp83640->rx_queue);
 	skb_queue_head_init(&dp83640->tx_queue);
 
-	dp83640->clock = clock;
-
-	if (choose_this_phy(clock, phydev)) {
+	if (!dp83640->slave) {
 		clock->chosen = dp83640;
 		clock->ptp_clock = ptp_clock_register(&clock->caps, &phydev->dev);
 		if (IS_ERR(clock->ptp_clock)) {
@@ -1327,7 +1422,10 @@ static struct phy_driver dp83640_driver = {
 	.hwtstamp	= dp83640_hwtstamp,
 	.rxtstamp	= dp83640_rxtstamp,
 	.txtstamp	= dp83640_txtstamp,
-	.driver		= {.owner = THIS_MODULE,}
+	.driver		= {
+		.owner = THIS_MODULE,
+		.of_match_table = of_match_ptr(dp83640_of_match_table),
+	}
 };
 
 static int __init dp83640_init(void)
-- 
1.8.5.3

^ permalink raw reply related

* Re: [PATCH] tun: use netif_receive_skb instead of netif_rx_ni
From: Michael S. Tsirkin @ 2014-02-11 15:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Qin Chuanyu, davem, jasowang, Anthony Liguori, KVM list, netdev
In-Reply-To: <1392132244.6615.83.camel@edumazet-glaptop2.roam.corp.google.com>

On Tue, Feb 11, 2014 at 07:24:04AM -0800, Eric Dumazet wrote:
> On Tue, 2014-02-11 at 22:25 +0800, Qin Chuanyu wrote:
> > we could xmit directly instead of going through softirq to gain 
> > throughput and lantency improved.
> > test model: VM-Host-Host just do transmit. with vhost thread and nic
> > interrupt bind cpu1. netperf do throuhput test and qperf do lantency test.
> > Host OS: suse11sp3, Guest OS: suse11sp3
> > 
> > latency result(us):
> > packet_len 64 256 512 1460
> > old(UDP)   44  47  48   66
> > new(UDP)   38  41  42   66
> > 
> > old(TCP)   52  55  70  117
> > new(TCP)   45  48  61  114
> > 
> > throughput result(Gbit/s):
> > packet_len   64   512   1024   1460
> > old(UDP)   0.42  2.02   3.75   4.68
> > new(UDP)   0.45  2.14   3.77   5.06
> > 
> > TCP due to the latency, client couldn't send packet big enough
> > to get benefit from TSO of nic, so the result show it will send
> > more packet per sencond but get lower throughput.
> > 
> > Eric mentioned that it would has problem with cgroup, but the patch
> > had been sent by Herbert Xu.
> > patch_id f845172531fb7410c7fb7780b1a6e51ee6df7d52
> > 
> > Signed-off-by: Chuanyu Qin <qinchuanyu@huawei.com>
> > ---
> >   drivers/net/tun.c |    4 +++-
> >   1 files changed, 3 insertions(+), 1 deletions(-)
> > 
> > diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> > index 44c4db8..90b4e58 100644
> > --- a/drivers/net/tun.c
> > +++ b/drivers/net/tun.c
> > @@ -1184,7 +1184,9 @@ static ssize_t tun_get_user(struct tun_struct 
> > *tun, struct tun_file *tfile,
> >   	skb_probe_transport_header(skb, 0);
> > 
> >   	rxhash = skb_get_hash(skb);
> > -	netif_rx_ni(skb);
> > +	rcu_read_lock_bh();
> > +	netif_receive_skb(skb);
> > +	rcu_read_unlock_bh();
> > 
> >   	tun->dev->stats.rx_packets++;
> >   	tun->dev->stats.rx_bytes += len;
> 
> I already said this patch is not good :
> 
> rcu_read_lock_bh() makes no sense here.
> 
> What is really needed is local_bh_disable();
> 
> Herbert patch ( http://patchwork.ozlabs.org/patch/52963/ ) had a much
> cleaner form.
> 
> Just use it, CC him, credit him, please ?
> 

But not before making sure (testing) that patch does not break the
cgroups classifier please.

-- 
MST

^ permalink raw reply

* Re: [PATCH v2 2/2] sctp: optimize the sctp_sysctl_net_register
From: Sergei Shtylyov @ 2014-02-11 16:47 UTC (permalink / raw)
  To: Wang Weidong, nhorman, davem, vyasevich; +Cc: dborkman, netdev
In-Reply-To: <1392083346-9420-3-git-send-email-wangweidong1@huawei.com>

Hello.

On 02/11/2014 04:49 AM, Wang Weidong wrote:

> Here, when the net is init_net, we needn't to kmemdup the ctl_table
> again. So add a check for net. Also we can save some memory.

> Signed-off-by: Wang Weidong <wangweidong1@huawei.com>
> ---
>   net/sctp/sysctl.c | 16 +++++++++-------
>   1 file changed, 9 insertions(+), 7 deletions(-)

> diff --git a/net/sctp/sysctl.c b/net/sctp/sysctl.c
> index 2ddb401..b65396b 100644
> --- a/net/sctp/sysctl.c
> +++ b/net/sctp/sysctl.c
> @@ -403,15 +403,17 @@ static int proc_sctp_do_rto_max(struct ctl_table *ctl, int write,
>
>   int sctp_sysctl_net_register(struct net *net)
>   {
> -	struct ctl_table *table;
> -	int i;
> +	struct ctl_table *table = sctp_net_table;
>
> -	table = kmemdup(sctp_net_table, sizeof(sctp_net_table), GFP_KERNEL);
> -	if (!table)
> -		return -ENOMEM;
> +	if (!net_eq(net, &init_net)) {
> +		int i;

    Empty line after declaration wouldn't hurt, just like above.

> +		table = kmemdup(sctp_net_table, sizeof(sctp_net_table), GFP_KERNEL);
> +		if (!table)
> +			return -ENOMEM;

WBR, Sergei

^ permalink raw reply

* RE: [PATCH] net: rtnetlink: fix correct size given to memset
From: David Laight @ 2014-02-11 15:54 UTC (permalink / raw)
  To: 'Eric Dumazet', Francois-Xavier Le Bail; +Cc: NETDEV, David Miller
In-Reply-To: <1392132522.6615.85.camel@edumazet-glaptop2.roam.corp.google.com>

From: Eric Dumazet
> On Tue, 2014-02-11 at 15:06 +0100, Francois-Xavier Le Bail wrote:
> > Find by cppcheck:
> > [net/core/rtnetlink.c:1842]: (warning) Using size of pointer linkinfo instead of size of its data.
> >
> > Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com>
> > ---
> >
> > The diagnosis of cppcheck seems relevant.
> >
> >  net/core/rtnetlink.c |    2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
> > index 048dc8d..9a5dbf1 100644
> > --- a/net/core/rtnetlink.c
> > +++ b/net/core/rtnetlink.c
> > @@ -1839,7 +1839,7 @@ replay:
> >  		if (err < 0)
> >  			return err;
> >  	} else
> > -		memset(linkinfo, 0, sizeof(linkinfo));
> > +		memset(linkinfo, 0, sizeof(*linkinfo));
> >
> >  	if (linkinfo[IFLA_INFO_KIND]) {
> >  		nla_strlcpy(kind, linkinfo[IFLA_INFO_KIND], sizeof(kind));
> 
> 
> This is absolutely wrong.
> 
> 	struct nlattr *linkinfo[IFLA_INFO_MAX+1];
> 
> sizeof(linkinfo) is totally appropriate
> 
> sizeof(*linkinfo) is equal to sizeof(void *), and this is not what we
> need here.

I guess that changing it to:
	memset(&linkinfo, 0, sizeof(linkinfo));

might make the tool happy, but it really ought to be taught about arrays.

	David


^ permalink raw reply

* Re: [PATCH] net: rtnetlink: fix correct size given to memset
From: François-Xavier Le Bail @ 2014-02-11 16:01 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: NETDEV, David Miller
In-Reply-To: <1392132522.6615.85.camel@edumazet-glaptop2.roam.corp.google.com>

> From: Eric Dumazet <eric.dumazet@gmail.com>

> To: Francois-Xavier Le Bail <fx.lebail@yahoo.com>
>>  Find by cppcheck:
>>  [net/core/rtnetlink.c:1842]: (warning) Using size of pointer linkinfo 
> instead of size of its data.
>> 
>>  Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com>
>>  ---
>> 
>>  The diagnosis of cppcheck seems relevant.
>> 
>>   net/core/rtnetlink.c |    2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>> 
> This is absolutely wrong.
> 
>     struct nlattr *linkinfo[IFLA_INFO_MAX+1];
> 
> sizeof(linkinfo) is totally appropriate 
> 
> sizeof(*linkinfo) is equal to sizeof(void *), and this is not what we
> need here.

Oops, sorry for the noise.

^ permalink raw reply

* Re: [PATCH] netfilter: nf_nat_snmp_basic: fix duplicates in if/else branches
From: Patrick McHardy @ 2014-02-11 16:11 UTC (permalink / raw)
  To: Francois-Xavier Le Bail, NETDEV, David Miller, netfilter-devel,
	netfilter, coreteam
In-Reply-To: <1392130165-4313-1-git-send-email-fx.lebail@yahoo.com>

On 11. Februar 2014 14:49:25 GMT+00:00, Francois-Xavier Le Bail <fx.lebail@yahoo.com> wrote:
>The solution was found by Patrick in 2.4 kernel sources.
>
>Cc: Patrick McHardy <kaber@trash.net>

Acked-by: Patrick McHardy <kaber@trash.net>

>Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com>
>---
> net/ipv4/netfilter/nf_nat_snmp_basic.c |    4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
>diff --git a/net/ipv4/netfilter/nf_nat_snmp_basic.c
>b/net/ipv4/netfilter/nf_nat_snmp_basic.c
>index d551e31..7c67667 100644
>--- a/net/ipv4/netfilter/nf_nat_snmp_basic.c
>+++ b/net/ipv4/netfilter/nf_nat_snmp_basic.c
>@@ -1198,8 +1198,8 @@ static int snmp_translate(struct nf_conn *ct,
> 		map.to = NOCT1(&ct->tuplehash[!dir].tuple.dst.u3.ip);
> 	} else {
> 		/* DNAT replies */
>-		map.from = NOCT1(&ct->tuplehash[dir].tuple.src.u3.ip);
>-		map.to = NOCT1(&ct->tuplehash[!dir].tuple.dst.u3.ip);
>+		map.from = NOCT1(&ct->tuplehash[!dir].tuple.src.u3.ip);
>+		map.to = NOCT1(&ct->tuplehash[dir].tuple.dst.u3.ip);
> 	}
> 
> 	if (map.from == map.to)
>--
>To unsubscribe from this list: send the line "unsubscribe
>netfilter-devel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply

* Re: RFC: bridge get fdb by bridge device
From: Jamal Hadi Salim @ 2014-02-11 17:03 UTC (permalink / raw)
  To: John Fastabend
  Cc: netdev@vger.kernel.org, vyasevic, Stephen Hemminger,
	Scott Feldman, John Fastabend
In-Reply-To: <52F7D7EE.7010302@gmail.com>

On 02/09/14 14:33, John Fastabend wrote:
> On 02/09/2014 07:06 AM, Jamal Hadi Salim wrote:

>
>> +    if (ndm->ndm_ifindex) {
>> +        dev = __dev_get_by_index(net, ndm->ndm_ifindex);
>> +        if (dev == NULL) {
>> +            pr_info("PF_BRIDGE: RTM_GETNEIGH with unknown ifindex\n");
>> +            return -ENODEV;
>> +        }
>> +
>> +        if (!(dev->priv_flags & IFF_EBRIDGE)) {
>
> Can we drop this 'if case' and just use the 'if (ops->ndo_fdb_dump)'
> below? IFF_EBRIDGE is specific to ./net/bridge so it will fail for
> macvlans and I think the command is useful in both cases.
>

My only worry is:
The target ndm_ifindex is to a bridge device i.e something
that has an "fdb" (user space says "br X" then we use X to
mean we are talking about a bridge device. This is what
brctl showmacs <X> means.
I know this doesnt seem right relative to current point of
view where the target is the bridge port ifindex.
I'd be more than happy to make the change if it makes sense,
but i am struggling with that thought.

>
>> +            pr_info("PF_BRIDGE: RTM_GETNEIGH %s not a bridge device\n",
>> +                dev->name);
>> +            return -EINVAL;
>>          }
>> +        ops = dev->netdev_ops;
>> +        if (ops->ndo_fdb_dump) {
>> +            idx = ops->ndo_fdb_dump(skb, cb, dev, idx);
>> +        } else {
>
> Is there any problem with using the ndo_dflt_fdb_dump() in the else
> here?
>

If the assumption is the target ifindex is to a bridge device, I would
expect it MUST have a ops->ndo_fdb_dump(), no?
the default dumper deals with bridge ports (which may be bridges, but
if i wanted their databases i would address them)

> Userspace should be able to easily learn which ports are ebridge ports
> so I don't think that should be an issue. Anyways with the above
> IFF_EBRIDGE check you should never hit this else case although I think
> its safe to drop the above check as noted.
>

Right - so if i can find out which is an ebridge i can send it the
same request.

cheers,
jamal

^ permalink raw reply

* Re: RFC: bridge get fdb by bridge device
From: Jamal Hadi Salim @ 2014-02-11 17:07 UTC (permalink / raw)
  To: vyasevic, netdev@vger.kernel.org
  Cc: Stephen Hemminger, Scott Feldman, John Fastabend
In-Reply-To: <52F8FEF1.60407@redhat.com>

On 02/10/14 11:31, Vlad Yasevich wrote:
> On 02/09/2014 10:06 AM, Jamal Hadi Salim wrote:


>> +	ndm = nlmsg_data(cb->nlh);
>> +	if (ndm->ndm_ifindex) {
>
> We get really lucky here that ndm_ifindex and ifi_index happen to map to
> the same location.
>

Didnt follow - but I have a feeling you are looking at the reference
point of a bridge port.
Note as per my response to John: The target is a bridge device, not
a bridge port.



>
> I agree with both of Johns commens fro the above code.
> I think you can use ndo_dflt_fdb_dump() here and remove the first check
> for IFF_EBRIDGE.
>

Same comment i made to John. The goal is to emulate
brctl showmacs <bridge>
ndo_dflt_fdb_dump() gives me in theory all the bridge ports
unicast and multicast MAC addresses. There is a posibility that
the bridgeport is a bridge - in which case I can find out from
user space and safely request for it directly instead of via
its parent.

> The only odd thing is that it would permit syntax like:
>   # bridge fbd show br eth0
> or
>   # bridge fdb show br macvlan0
>
> but I think that's ok.

Ok, since both you and John point to macvlan - is that
considered as something with an fdb? It doesnt forward
packets between two devices.



>> diff --git a/bridge/fdb.c b/bridge/fdb.c
>> index e2e53f1..f3073d6 100644
>> --- a/bridge/fdb.c
>> +++ b/bridge/fdb.c
>> @@ -33,7 +33,7 @@ static void usage(void)
>>   	fprintf(stderr, "Usage: bridge fdb { add | append | del | replace }
> ADDR dev DEV {self|master} [ temp ]\n"
>>   		        "              [router] [ dst IPADDR] [ vlan VID ]\n"
>>   		        "              [ port PORT] [ vni VNI ] [via DEV]\n");
>> -	fprintf(stderr, "       bridge fdb {show} [ dev DEV ]\n");
>> +	fprintf(stderr, "       bridge fdb {show} [ br BRDEV ] [ dev DEV ]\n");
>
> 'port' option is now allowed in the show operation
>

Thanks - it is already taken seems by vxlan using the same interface.


cheers,
jamal

^ permalink raw reply

* Re: [PATCH 13/13] net: Mark functions as static in net/sunrpc/svc_xprt.c
From: J. Bruce Fields @ 2014-02-11 17:12 UTC (permalink / raw)
  To: Rashika Kheria
  Cc: linux-kernel, Trond Myklebust, David S. Miller, linux-nfs, netdev,
	josh
In-Reply-To: <259f3dc38d7115eccf3aaa2aa7de2414a47f90a2.1391888654.git.rashika.kheria@gmail.com>

On Sun, Feb 09, 2014 at 01:40:54AM +0530, Rashika Kheria wrote:
> Mark functions as static in net/sunrpc/svc_xprt.c because they are not
> used outside this file.
> 
> This eliminates the following warning in net/sunrpc/svc_xprt.c:
> net/sunrpc/svc_xprt.c:574:5: warning: no previous prototype for ‘svc_alloc_arg’ [-Wmissing-prototypes]
> net/sunrpc/svc_xprt.c:615:18: warning: no previous prototype for ‘svc_get_next_xprt’ [-Wmissing-prototypes]
> net/sunrpc/svc_xprt.c:694:6: warning: no previous prototype for ‘svc_add_new_temp_xprt’ [-Wmissing-prototypes]

Thanks, applying.--b.

> 
> Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
> ---
>  net/sunrpc/svc_xprt.c |    6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index 80a6640..06c6ff0 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -571,7 +571,7 @@ static void svc_check_conn_limits(struct svc_serv *serv)
>  	}
>  }
>  
> -int svc_alloc_arg(struct svc_rqst *rqstp)
> +static int svc_alloc_arg(struct svc_rqst *rqstp)
>  {
>  	struct svc_serv *serv = rqstp->rq_server;
>  	struct xdr_buf *arg;
> @@ -612,7 +612,7 @@ int svc_alloc_arg(struct svc_rqst *rqstp)
>  	return 0;
>  }
>  
> -struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
> +static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
>  {
>  	struct svc_xprt *xprt;
>  	struct svc_pool		*pool = rqstp->rq_pool;
> @@ -691,7 +691,7 @@ struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
>  	return xprt;
>  }
>  
> -void svc_add_new_temp_xprt(struct svc_serv *serv, struct svc_xprt *newxpt)
> +static void svc_add_new_temp_xprt(struct svc_serv *serv, struct svc_xprt *newxpt)
>  {
>  	spin_lock_bh(&serv->sv_lock);
>  	set_bit(XPT_TEMP, &newxpt->xpt_flags);
> -- 
> 1.7.9.5
> 

^ permalink raw reply

* RE: Bonding
From: Gustavo Pimentel @ 2014-02-11 17:15 UTC (permalink / raw)
  To: Veaceslav Falico; +Cc: netdev@vger.kernel.org
In-Reply-To: <20140211141549.GB29570@redhat.com>

Hi Veaceslav,

It's quite different from broadcast mode, each frame sent through the slaves has attached a Redundancy Control Trailer also known as RCT (this trailer is compose by a LAN identifier, sequence number, a LSDU size and a PRP suffix).
Also the equipment with PRP capability has to send periodically a supervision frame to both similar LANs. Each device on the network has to keep track of receive sequence numbers received, if the received a sequence number for instance from LAN A of specific device and it doesn't exist on internal table, the device should accept the frame and update the internal table. When receiving the same sequence number from the LAN B, the device should discard it, providing zero downtime redundancy.

I can supply you information about this redundancy protocol, if you like. This type of network redundancy is now being large deployed on electrical power stations (like thermal and hydro) and transmission power stations instead of teaming / bonding that depends on RSTP for redundancy.

> -----Original Message-----
> From: Veaceslav Falico [mailto:vfalico@redhat.com]
> Sent: terça-feira, 11 de Fevereiro de 2014 14:16
> To: Gustavo Pimentel
> Cc: netdev@vger.kernel.org
> Subject: Re: Bonding
> 
> On Tue, Feb 11, 2014 at 01:53:32PM +0000, Gustavo Pimentel wrote:
> >Hi,
> 
> Hi Gustavo,
> 
> >
> >I'm writing you because because I'm have implemented a new mode (PRP Parallel
> Redundancy Protocol) for bonding kernel driver. This new mode is quite simple, I
> don't know if you have heard about PRP, but it's a new standard that allows to
> overcome any single network failure without affecting the data transmission. The
> general idea resides on having two separate LAN (A & B) very similar and
> transmitting the almost the same frame through both LANs and the end device
> should accept one frame and discard the other according to a known mechanism.
> 
> Isn't that the current 'broadcast' mode, where every packet is transmitted over all
> the slaves? After quick googling/reading I don't see any difference there, though I
> might have missed something.
> 
> >
> >I have implemented this new mode on bonding driver, but I have some
> difficulties:
> >. Writing linux driver is quite new for me. I don't' know if exists guide lines for
> driver coding.
> 
> You can find everything under Documentation/, but without the code I can't tell you
> exact documents. CodingStyle and SubmittingPatches might be the first ones.
> 
> Also, try CC-ing relevant people for more feedback, specifically bonding
> maintainers.
> 
> >. I don't know how to submit the code to be include on kernel repository.
> >. Maybe another pair of eyes could find help to improve the writing code for this
> mode.
> 
> Try sending an RFC when net-next opens.
> 
> >
> >I think my driver code is 99% complete. I'm currently testing with 3 equipments (1
> pc + 1 embedded device running both my modify bonding driver) and a third party
> equipment called RedBox.
> >
> >Would you be interested in participating / helping this project?
> >
> >With my best regards,
> >
> >Gustavo Gama da Rocha Pimentel
> >Power Systems Automation / Innovation & Development Efacec Engenharia e
> >Sistemas, S.A.
> >Phone: +351229403391
> >Disclaimer
> >
> >
> >--
> >To unsubscribe from this list: send the line "unsubscribe netdev" in
> >the body of a message to majordomo@vger.kernel.org More majordomo info
> >at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: linux 3.13: problems with isatap tunnel device and UFO
From: Wolfgang Walter @ 2014-02-11 17:42 UTC (permalink / raw)
  To: netdev; +Cc: Hannes Frederic Sowa
In-Reply-To: <20140211024403.GE11150@order.stressinduktion.org>

Am Dienstag, 11. Februar 2014, 03:44:03 schrieb Hannes Frederic Sowa:
> On Sun, Feb 09, 2014 at 12:17:15AM +0100, Wolfgang Walter wrote:
> > host A (which shows the problem with kernel 3.13):
> > 
> > $ ip addr ls eth0
> > 4: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state
> > UP group default qlen 1000
> > 
> >     link/ether 11:22:33:44:55:66 brd ff:ff:ff:ff:ff:ff
> >     inet 192.168.1.1/24 brd 192.168.1.255 scope global eth0
> >        valid_lft forever preferred_lft forever
> >     inet6 2001:1111:2222:aaaa:0:5efe:c0a8:101/120 scope global
> >        valid_lft forever preferred_lft forever
> >     inet6 fe80::1322:33ff:fe44:5566/64 scope link
> >        valid_lft forever preferred_lft forever
> 
> What driver does this interface use?
> 
> ethtool -i eth0
> 

driver: r8169
version: 2.3LK-NAPI
firmware-version: rtl_nic/rtl8168e-1.fw
bus-info: 0000:05:00.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: yes
supports-priv-flags: no

I see this also with another machine, here it is

driver: forcedeth
version: 0.64
firmware-version: 
bus-info: 0000:00:0a.0
supports-statistics: yes                                                        
supports-test: yes
supports-eeprom-access: no
supports-register-dump: yes
supports-priv-flags: no


Regards,
-- 
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts

^ permalink raw reply

* [PATCH 0/3] net: Generalizations for GRE,GSO,GRO
From: Tom Herbert @ 2014-02-11 17:43 UTC (permalink / raw)
  To: davem, netdev; +Cc: ogerlitz

This patch set contains some preliminary patches for Generic UDP
Encapsulation support. These generalize some uses of GRE, GSO, and
GRO.

^ permalink raw reply

* [PATCH 2/3] net: UDP gro_receive accept csum=0
From: Tom Herbert @ 2014-02-11 17:43 UTC (permalink / raw)
  To: davem, netdev; +Cc: ogerlitz

The code to validate checksum in UDP gro_receive explictly checks
against driver having set CHECKSUM_COMPLETE. This does not perform
GRO on UDP packets with a checksum of zero (no checksum needed).
This patch adds the condition to allow UDP checksum to be zero.

Signed-off-by: Tom Herbert <therbert@google.com>
---
 net/ipv4/udp_offload.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 25f5cee..4db7796 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -156,13 +156,9 @@ static struct sk_buff **udp_gro_receive(struct sk_buff **head, struct sk_buff *s
 	unsigned int hlen, off;
 	int flush = 1;
 
-	if (NAPI_GRO_CB(skb)->udp_mark ||
-	    (!skb->encapsulation && skb->ip_summed != CHECKSUM_COMPLETE))
+	if (NAPI_GRO_CB(skb)->udp_mark)
 		goto out;
 
-	/* mark that this skb passed once through the udp gro layer */
-	NAPI_GRO_CB(skb)->udp_mark = 1;
-
 	off  = skb_gro_offset(skb);
 	hlen = off + sizeof(*uh);
 	uh   = skb_gro_header_fast(skb, off);
@@ -172,6 +168,13 @@ static struct sk_buff **udp_gro_receive(struct sk_buff **head, struct sk_buff *s
 			goto out;
 	}
 
+	if (!skb->encapsulation &&
+	    skb->ip_summed != CHECKSUM_COMPLETE && uh->check != 0)
+		goto out;
+
+	/* mark that this skb passed once through the udp gro layer */
+	NAPI_GRO_CB(skb)->udp_mark = 1;
+
 	rcu_read_lock();
 	uo_priv = rcu_dereference(udp_offload_base);
 	for (; uo_priv != NULL; uo_priv = rcu_dereference(uo_priv->next)) {
-- 
1.9.0.rc1.175.g0b1dcb5

^ permalink raw reply related

* Re: [PATCH V3] net/dt: Add support for overriding phy configuration from device tree
From: Florian Fainelli @ 2014-02-11 17:43 UTC (permalink / raw)
  To: Gerlando Falauto
  Cc: Matthew Garrett, netdev,
	devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Kishon Vijay Abraham I
In-Reply-To: <52F9E8E6.1090006-SkAbAL50j+5BDgjK7y7TUQ@public.gmane.org>

Hi Gerlando,

2014-02-11 1:09 GMT-08:00 Gerlando Falauto <gerlando.falauto-SkAbAL50j+6IwRZHo2/mJg@public.gmane.orgm>:
> Hi Florian,
>
> first of all, thank you for your answer.
>
>
> On 02/10/2014 06:09 PM, Florian Fainelli wrote:
>>
>> Hi Gerlando,
>>
>> Le lundi 10 février 2014, 17:14:59 Gerlando Falauto a écrit :
>>>
>>> Hi,
>>>
>>> I'm currently trying to fix an issue for which this patch provides a
>>> partial solution, so apologies in advance for jumping into the
>>> discussion for my own purposes...
>>>
>>> On 02/04/2014 09:39 PM, Florian Fainelli wrote:> 2014-01-17 Matthew
>>>
>>> Garrett <matthew.garrett-05XSO3Yj/JvQT0dZR+AlfA@public.gmane.org>:
>>>   >> Some hardware may be broken in interesting and board-specific ways,
>>> such
>>>   >> that various bits of functionality don't work. This patch provides a
>>>   >> mechanism for overriding mii registers during init based on the
>>>
>>> contents of
>>>
>>>   >> the device tree data, allowing board-specific fixups without having
>>> to
>>>   >> pollute generic code.
>>>   >
>>>   > It would be good to explain exactly how your hardware is broken
>>>   > exactly. I really do not think that such a fine-grained setting where
>>>   > you could disable, e.g: 100BaseT_Full, but allow 100BaseT_Half to
>>>   > remain usable makes that much sense. In general, Gigabit might be
>>>   > badly broken, but 100 and 10Mbits/sec should work fine. How about the
>>>   > MASTER-SLAVE bit, is overriding it really required?
>>>   >
>>>   > Is not a PHY fixup registered for a specific OUI the solution you are
>>>   > looking for? I am also concerned that this creates PHY
>>> troubleshooting
>>>   > issues much harder to debug than before as we may have no idea about
>>>   > how much information has been put in Device Tree to override that.
>>>   >
>>>   > Finally, how about making this more general just like the BCM87xx PHY
>>>   > driver, which is supplied value/reg pairs directly? There are 16
>>>   > common MII registers, and 16 others for vendor specific registers,
>>>   > this is just covering for about 2% of the possible changes.
>>>
>>> Good point. That would easily help me with my current issue, which
>>> requires autoneg to be disabled to begin with (by clearing BMCR_ANENABLE
>>> from register 0).
>>
>>
>> Is there a point in time (e.g: after some specific initial configuration
>> has
>> been made) where BMCR_ANENABLE can be used?
>
>
> What do you mean? In my case, for some HW-related reason (due to the PHY
> counterpart I guess) autoneg needs to be disabled.
> This is currently done by the bootloader code (which clears the bit).
> What I'm looking for is some way for the kernel to either reinforce this
> setting, or just take that into account and skip autoneg.
> On top of that, there's a HW errata about that particular PHY, which
> requires certain operations to be performed on the PHY as a workaround *WHEN
> AUTONEG IS DISABLED*. That I'd implement on a PHY-specif driver.

Ok.

>
>
>>> This would not however fix it entirely (I tried a quick hardwired
>>> implementation), as the whole PHY machinery would not take that into
>>> account and would re-enable autoneg anyway.
>>> I also tried changing the patch so that phydev->support gets updated
>>
>>
>> There are multiple things that you could try doing here:
>>
>> - override the PHY state machine in your read_status callback to make sure
>> that you always set phydev->autoneg set to AUTONEG_ENABLE
>
>
> [you mean AUTONEG_DISABLE, right?]

Right, I fat fingered here.

> Uhm, but I don't want to implement a driver for that PHY that always
> disables autoneg. I only want to disable autoneg for that particular board.
> I figure I might register a fixup for that board, but that kindof makes
> everything more complicated and less clear. Plus, what should be the
> criterion to determine whether we're running on that particular hardware?

of_machine_is_compatible() plus reading the specific PHY OUI should
provide you with with an unique machine + PHY tuple. If your machine
name is too generic.

>
>
>> - clear the SUPPORTED_Autoneg bits from phydev->supported right after PHY
>> registration and before the call to phy_start()
>
>
> I actually tried clearing it by tweaking the patch on this thread, but the
> end result is that it does not produce any effect (see further comments
> below). Only thing that seems to play a role here is explictly setting
> phydev->autoneg = AUTONEG_DISABLE.
>
>
>> - set the PHY_HAS_MAGICANEG bit in your PHY driver flag
>
>
> Again, this seems to play no role whatsoever here:
>
>                         } else if (0 == phydev->link_timeout--) {
>                                 needs_aneg = 1;
>                                 /* If we have the magic_aneg bit,
>                                  * we try again */
>                                 if (phydev->drv->flags & PHY_HAS_MAGICANEG)
>                                         break;
>                         }
>                         break;
>                 case PHY_NOLINK:
>
> This code might have made sense when it was written in 2006 -- back then,
> the break statement was skipping some fallback code. But now it seems to do
> nothing.
>
>
>>
>>>
>>> (instead of phydev->advertising):
>>>   >> +               if (!of_property_read_u32(np, override->prop, &tmp))
>>> {
>>>   >> +                       if (tmp) {
>>>   >> +                               *val |= override->value;
>>>   >> +                               phydev->advertising |=
>>>
>>> override->supported;
>>>
>>>   >> +                       } else {
>>>   >> +                               phydev->advertising &=
>>>
>>> ~(override->supported);
>>>
>>>   >> +                       }
>>>   >> +
>>>   >> +                       *mask |= override->value;
>>>
>>> What I find weird is that the only way phydev->autoneg could ever be set
>>> to disabled is from here (phy.c):
>>>
>>> static void phy_sanitize_settings(struct phy_device *phydev)
>>> {
>>>         u32 features = phydev->supported;
>>>         int idx;
>>>
>>>         /* Sanitize settings based on PHY capabilities */
>>>         if ((features & SUPPORTED_Autoneg) == 0)
>>>                 phydev->autoneg = AUTONEG_DISABLE;
>>>
>>> which is in turn only called when phydev->autoneg is set to
>>> AUTONEG_DISABLE to begin with:
>>>
>>> int phy_start_aneg(struct phy_device *phydev)
>>> {
>>>         int err;
>>>
>>>         mutex_lock(&phydev->lock);
>>>
>>>         if (AUTONEG_DISABLE == phydev->autoneg)
>>>                 phy_sanitize_settings(phydev);
>>>
>>> So could someone please help me figure out what I'm missing here?
>>
>>
>> At first glance it looks like the PHY driver should be reading the phydev-
>>>
>>> autoneg value when the PHY driver config_aneg() callback is called to be
>>
>> allowed to set the forced speed and settings.
>>
>> The way phy_sanitize_settings() is coded does not make it return a mask of
>> features, but only the forced supported speed and duplex. Then when the
>> link
>> is forced but we are having some issues getting a link status, libphy
>> tries
>> lower speeds and this function is used again to provide the next
>> speed/duplex
>> pair to try.
>>
>
> What I was trying to say is that phy_sanitize_settings() is only called when
> phydev->autoneg == AUTONEG_DISABLE, and in turn it's the only generic
> function setting phydev->autoneg = AUTONEG_DISABLE.
> So perhaps the condition should read:
>
> -       if (AUTONEG_DISABLE == phydev->autoneg)
> +       if ((features & SUPPORTED_Autoneg) == 0)
>                 phy_sanitize_settings(phydev);
>
> Or else, some other parts of the generic code should take care of setting it
> to AUTONEG_DISABLE, depending on whether the feature is supported or not.
> What I found weird is explicitly setting a value (phydev->autoneg =
> AUTONEG_DISABLE), from a static function which is only called when that
> condition is already true.

I do not think that this change is correct either, let me cook a patch
for you to allow disabling autoneg from the start.

>
> BTW, I feel like disabling autoneg from the start has never been a use case
> before, am I right?

Not really no, and that is because most hardware does not need quirks
to work correctly.

>
> Thanks!
> Gerlando



-- 
Florian
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH 3/3] net: GSO encapsulation for IP packets
From: Tom Herbert @ 2014-02-11 17:43 UTC (permalink / raw)
  To: davem, netdev; +Cc: ogerlitz

The UDP GSO code assume that only encapsulated packets are Ethernet
frames. This patch fixes that so that we can support IP protocol
encpasulation (GUE, GRE/UDP, etc.)

We overload the inner_protocol field in the skb to store either the
Ethertype or the IP protocol (latter is indicated by ip_encapsulation
bit). As far as I can tell this should not adversely affect preexiting
uses for inner_protocol.

Signed-off-by: Tom Herbert <therbert@google.com>
---
 include/linux/skbuff.h |  8 ++++++--
 net/core/skbuff.c      |  1 +
 net/ipv4/udp.c         | 12 +++++++++++-
 3 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 1f689e6..757ed39 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -512,7 +512,11 @@ struct sk_buff {
 	 * headers if needed
 	 */
 	__u8			encapsulation:1;
-	/* 6/8 bit hole (depending on ndisc_nodetype presence) */
+	/* skbuf encpasulates an IP packet, inner_protocol should be
+	 * interpreted as an IP protocol, encapsulation bit is also set
+	 */
+	__u8			ip_encapsulation:1;
+	/* 5/7 bit hole (depending on ndisc_nodetype presence) */
 	kmemcheck_bitfield_end(flags2);
 
 #if defined CONFIG_NET_DMA || defined CONFIG_NET_RX_BUSY_POLL
@@ -530,7 +534,7 @@ struct sk_buff {
 		__u32		reserved_tailroom;
 	};
 
-	__be16			inner_protocol;
+	__u16			inner_protocol;
 	__u16			inner_transport_header;
 	__u16			inner_network_header;
 	__u16			inner_mac_header;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 8f519db..64c6190 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -687,6 +687,7 @@ static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 	new->ooo_okay		= old->ooo_okay;
 	new->no_fcs		= old->no_fcs;
 	new->encapsulation	= old->encapsulation;
+	new->ip_encapsulation	= old->ip_encapsulation;
 #ifdef CONFIG_XFRM
 	new->sp			= secpath_get(old->sp);
 #endif
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 77bd16f..48d8cb2 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2497,7 +2497,17 @@ struct sk_buff *skb_udp_tunnel_segment(struct sk_buff *skb,
 
 	/* segment inner packet. */
 	enc_features = skb->dev->hw_enc_features & netif_skb_features(skb);
-	segs = skb_mac_gso_segment(skb, enc_features);
+
+	if (skb->ip_encapsulation) {
+		const struct net_offload *ops;
+		ops = rcu_dereference(inet_offloads[skb->inner_protocol]);
+		if (likely(ops && ops->callbacks.gso_segment))
+			segs = ops->callbacks.gso_segment(skb, enc_features);
+	} else {
+		skb->protocol = htons(ETH_P_TEB);
+		segs = skb_mac_gso_segment(skb, enc_features);
+	}
+
 	if (!segs || IS_ERR(segs)) {
 		skb_gso_error_unwind(skb, protocol, tnl_hlen, mac_offset,
 				     mac_len);
-- 
1.9.0.rc1.175.g0b1dcb5

^ permalink raw reply related

* [PATCH 1/3] net: Fix GRE RX to use skb_transport_header for GRE
From: Tom Herbert @ 2014-02-11 17:43 UTC (permalink / raw)
  To: davem, netdev; +Cc: ogerlitz

GRE assumes that the GRE header is at skb_network_header +
ip_hrdlen(skb). It is more general to use skb_transport_header
and this allows the possbility of inserting additional header
between IP and GRE (which is what we will done in Generic UDP
Encapsulation for GRE).

Signed-off-by: Tom Herbert <therbert@google.com>
---
 net/ipv4/gre_demux.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/gre_demux.c b/net/ipv4/gre_demux.c
index 1863422f..6162269 100644
--- a/net/ipv4/gre_demux.c
+++ b/net/ipv4/gre_demux.c
@@ -118,7 +118,6 @@ static __sum16 check_checksum(struct sk_buff *skb)
 static int parse_gre_header(struct sk_buff *skb, struct tnl_ptk_info *tpi,
 			    bool *csum_err)
 {
-	unsigned int ip_hlen = ip_hdrlen(skb);
 	const struct gre_base_hdr *greh;
 	__be32 *options;
 	int hdr_len;
@@ -126,7 +125,7 @@ static int parse_gre_header(struct sk_buff *skb, struct tnl_ptk_info *tpi,
 	if (unlikely(!pskb_may_pull(skb, sizeof(struct gre_base_hdr))))
 		return -EINVAL;
 
-	greh = (struct gre_base_hdr *)(skb_network_header(skb) + ip_hlen);
+	greh = (struct gre_base_hdr *)skb_transport_header(skb);
 	if (unlikely(greh->flags & (GRE_VERSION | GRE_ROUTING)))
 		return -EINVAL;
 
@@ -136,7 +135,7 @@ static int parse_gre_header(struct sk_buff *skb, struct tnl_ptk_info *tpi,
 	if (!pskb_may_pull(skb, hdr_len))
 		return -EINVAL;
 
-	greh = (struct gre_base_hdr *)(skb_network_header(skb) + ip_hlen);
+	greh = (struct gre_base_hdr *)skb_transport_header(skb);
 	tpi->proto = greh->protocol;
 
 	options = (__be32 *)(greh + 1);
-- 
1.9.0.rc1.175.g0b1dcb5

^ permalink raw reply related

* Re: Bonding
From: Jay Vosburgh @ 2014-02-11 17:55 UTC (permalink / raw)
  To: Gustavo Pimentel; +Cc: Veaceslav Falico, netdev@vger.kernel.org
In-Reply-To: <8532C3BD1ECBC64BA47CE43AFED6D38EC37E5B04@S103.efacec.pt>

Gustavo Pimentel <gustavo.pimentel@efacec.com> wrote:

>Hi Veaceslav,
>
>It's quite different from broadcast mode, each frame sent through the slaves has attached a Redundancy Control Trailer also known as RCT (this trailer is compose by a LAN identifier, sequence number, a LSDU size and a PRP suffix).
>Also the equipment with PRP capability has to send periodically a supervision frame to both similar LANs. Each device on the network has to keep track of receive sequence numbers received, if the received a sequence number for instance from LAN A of specific device and it doesn't exist on internal table, the device should accept the frame and update the internal table. When receiving the same sequence number from the LAN B, the device should discard it, providing zero downtime redundancy.
>
>I can supply you information about this redundancy protocol, if you like. This type of network redundancy is now being large deployed on electrical power stations (like thermal and hydro) and transmission power stations instead of teaming / bonding that depends on RSTP for redundancy.

	Are you aware that there is already an implementation of HSR
(High-availability Seamless Redundancy) in the linux kernel?  I believe
HSR and PRP are defined by the same standard (IEC 62439-3), and are
similar enough to interoperate to some degree.  Perhaps PRP would be
better implemented as a variant within the existing net/hsr/ framework.

	-J


>> -----Original Message-----
>> From: Veaceslav Falico [mailto:vfalico@redhat.com]
>> Sent: terça-feira, 11 de Fevereiro de 2014 14:16
>> To: Gustavo Pimentel
>> Cc: netdev@vger.kernel.org
>> Subject: Re: Bonding
>> 
>> On Tue, Feb 11, 2014 at 01:53:32PM +0000, Gustavo Pimentel wrote:
>> >Hi,
>> 
>> Hi Gustavo,
>> 
>> >
>> >I'm writing you because because I'm have implemented a new mode (PRP Parallel
>> Redundancy Protocol) for bonding kernel driver. This new mode is quite simple, I
>> don't know if you have heard about PRP, but it's a new standard that allows to
>> overcome any single network failure without affecting the data transmission. The
>> general idea resides on having two separate LAN (A & B) very similar and
>> transmitting the almost the same frame through both LANs and the end device
>> should accept one frame and discard the other according to a known mechanism.
>> 
>> Isn't that the current 'broadcast' mode, where every packet is transmitted over all
>> the slaves? After quick googling/reading I don't see any difference there, though I
>> might have missed something.
>> 
>> >
>> >I have implemented this new mode on bonding driver, but I have some
>> difficulties:
>> >. Writing linux driver is quite new for me. I don't' know if exists guide lines for
>> driver coding.
>> 
>> You can find everything under Documentation/, but without the code I can't tell you
>> exact documents. CodingStyle and SubmittingPatches might be the first ones.
>> 
>> Also, try CC-ing relevant people for more feedback, specifically bonding
>> maintainers.
>> 
>> >. I don't know how to submit the code to be include on kernel repository.
>> >. Maybe another pair of eyes could find help to improve the writing code for this
>> mode.
>> 
>> Try sending an RFC when net-next opens.
>> 
>> >
>> >I think my driver code is 99% complete. I'm currently testing with 3 equipments (1
>> pc + 1 embedded device running both my modify bonding driver) and a third party
>> equipment called RedBox.
>> >
>> >Would you be interested in participating / helping this project?
>> >
>> >With my best regards,
>> >
>> >Gustavo Gama da Rocha Pimentel
>> >Power Systems Automation / Innovation & Development Efacec Engenharia e
>> >Sistemas, S.A.
>> >Phone: +351229403391
>> >Disclaimer

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox