Netdev List

Netdev List
 help / color / mirror / Atom feed

* Fwd: Linux bridge for route
From: tingwei liu @ 2014-01-17  9:14 UTC (permalink / raw)
  To: netfilter-devel, netfilter, netdev; +Cc: tingwei liu
In-Reply-To: <CA+qZnSRGKiFh8FiKMpVQku7CiWCMPq47FpPwrawvTC6KBPwz7A@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 821 bytes --]

Dear all,

       There is  a question has puzzled me for a long time.
       You can find the topology from attachment.

       Normal traffic is:

               PC(192.168.1.8)--->Bridge(eth0)--->Bridget(eth1)--->NAT
server-->switch--->Server(192.168.5.3)

       Now I want the ssh traffic like this:
              PC(182.168.1.8)--->Bridge(eth0)--->eth2--->NAT
server--->switch--->Server(192.168.5.3)


      What I have done on LINUX Server:
              #net.bridge.bridge-nf-call-iptables = 1
              #iptables -t nat -A POSTROUTING -s 192.168.1.8 -p tcp
--dport 22 -j SNAT --to-source 192.168.5.2

      I have find the rule matched through command "iptables -t nat
-nvL", but the packets doesn't sent to 192.168.5.3.
      and "tcpdump -i eth2 tcp port 22" can not capture any packet!

Thanks very much!

[-- Attachment #2: topology.jpg --]
[-- Type: image/jpeg, Size: 84648 bytes --]

^ permalink raw reply

* [net-next 0/3] Intel Wired LAN Driver Updates
From: Aaron Brown @ 2014-01-17  9:21 UTC (permalink / raw)
  To: davem; +Cc: Aaron Brown, netdev, gospo, sassmann

This series contains an updates to ixgbe and ixgbevf.

Jacob add braces around some ixgbe_qv_lock_* calls lto better adhere 
to Kernel style guidelines.  Don bumps the versions on ixgbe and
ixgbevf to match internal driver functionality better.

Don Skidmore (2):
  ixgbe: bump version number
  ixgbevf: bump version

Jacob Keller (1):
  ixgbe: add braces around else condition in ixgbe_qv_lock_* calls

 drivers/net/ethernet/intel/ixgbe/ixgbe.h          | 6 ++++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     | 2 +-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +-
 3 files changed, 6 insertions(+), 4 deletions(-)

-- 
1.8.5.GIT

^ permalink raw reply

* [net-next 2/3] ixgbe: bump version number
From: Aaron Brown @ 2014-01-17  9:21 UTC (permalink / raw)
  To: davem; +Cc: Don Skidmore, netdev, gospo, sassmann, Aaron Brown
In-Reply-To: <1389950498-8820-1-git-send-email-aaron.f.brown@intel.com>

From: Don Skidmore <donald.c.skidmore@intel.com>

Bump the version number to better match functionality provided with out
of tree driver of the same version.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Aaron Brown <aaron.f.brown@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index b445ad1..72e2035 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -64,7 +64,7 @@ char ixgbe_default_device_descr[] =
 static char ixgbe_default_device_descr[] =
 			      "Intel(R) 10 Gigabit Network Connection";
 #endif
-#define DRV_VERSION "3.15.1-k"
+#define DRV_VERSION "3.19.1-k"
 const char ixgbe_driver_version[] = DRV_VERSION;
 static const char ixgbe_copyright[] =
 				"Copyright (c) 1999-2013 Intel Corporation.";
-- 
1.8.5.GIT

^ permalink raw reply related

* [net-next 1/3] ixgbe: add braces around else condition in ixgbe_qv_lock_* calls
From: Aaron Brown @ 2014-01-17  9:21 UTC (permalink / raw)
  To: davem; +Cc: Jacob Keller, netdev, gospo, sassmann, Aaron Brown
In-Reply-To: <1389950498-8820-1-git-send-email-aaron.f.brown@intel.com>

From: Jacob Keller <jacob.e.keller@intel.com>

This patch adds braces around the ixgbe_qv_lock_* calls which previously only
had braces around the if portion. Kernel style guidelines for this require
parenthesis around all conditions if they are required around one. In addition
the comment while not illegal C syntax makes the code look wrong at a cursory
glance. This patch corrects the style and adds braces so that the full if-else
block is uniform.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Aaron Brown <aaron.f.brown@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 3a4373f..0186ea2 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -424,9 +424,10 @@ static inline bool ixgbe_qv_lock_napi(struct ixgbe_q_vector *q_vector)
 #ifdef BP_EXTENDED_STATS
 		q_vector->tx.ring->stats.yields++;
 #endif
-	} else
+	} else {
 		/* we don't care if someone yielded */
 		q_vector->state = IXGBE_QV_STATE_NAPI;
+	}
 	spin_unlock_bh(&q_vector->lock);
 	return rc;
 }
@@ -458,9 +459,10 @@ static inline bool ixgbe_qv_lock_poll(struct ixgbe_q_vector *q_vector)
 #ifdef BP_EXTENDED_STATS
 		q_vector->rx.ring->stats.yields++;
 #endif
-	} else
+	} else {
 		/* preserve yield marks */
 		q_vector->state |= IXGBE_QV_STATE_POLL;
+	}
 	spin_unlock_bh(&q_vector->lock);
 	return rc;
 }
-- 
1.8.5.GIT

^ permalink raw reply related

* [net-next 3/3] ixgbevf: bump version
From: Aaron Brown @ 2014-01-17  9:21 UTC (permalink / raw)
  To: davem; +Cc: Don Skidmore, netdev, gospo, sassmann, Aaron Brown
In-Reply-To: <1389950498-8820-1-git-send-email-aaron.f.brown@intel.com>

From: Don Skidmore <donald.c.skidmore@intel.com>

Bump the version number to better match functionality provided with out of
tree driver of the same version.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Aaron Brown <aaron.f.brown@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 6cf4120..9c92918 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -58,7 +58,7 @@ const char ixgbevf_driver_name[] = "ixgbevf";
 static const char ixgbevf_driver_string[] =
 	"Intel(R) 10 Gigabit PCI Express Virtual Function Network Driver";
 
-#define DRV_VERSION "2.11.3-k"
+#define DRV_VERSION "2.12.1-k"
 const char ixgbevf_driver_version[] = DRV_VERSION;
 static char ixgbevf_copyright[] =
 	"Copyright (c) 2009 - 2012 Intel Corporation.";
-- 
1.8.5.GIT

^ permalink raw reply related

* [PATCH net-next] virtio-net: fix build error when CONFIG_AVERAGE is not enabled
From: Michael Dalton @ 2014-01-17  9:27 UTC (permalink / raw)
  To: David S. Miller
  Cc: Michael Dalton, Michael S. Tsirkin, netdev, virtualization,
	Eric Dumazet

Commit ab7db91705e9 ("virtio-net: auto-tune mergeable rx buffer size for
improved performance") introduced a virtio-net dependency on EWMA.
The inclusion of EWMA is controlled by CONFIG_AVERAGE. Fix build error
when CONFIG_AVERAGE is not enabled by adding select AVERAGE to
virtio-net's Kconfig entry.

Build failure reported using config make ARCH=s390 defconfig.

Signed-off-by: Michael Dalton <mwdalton@google.com>
---
 drivers/net/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index b45b240..f342278 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -236,6 +236,7 @@ config VETH
 config VIRTIO_NET
 	tristate "Virtio network driver"
 	depends on VIRTIO
+	select AVERAGE
 	---help---
 	  This is the virtual network driver for virtio.  It can be used with
 	  lguest or QEMU based VMMs (like KVM or Xen).  Say Y or M.
-- 
1.8.5.2

^ permalink raw reply related

* RE: [PATCH 11/13] net: mvneta: implement rx_copybreak
From: David Laight @ 2014-01-17  9:28 UTC (permalink / raw)
  To: 'Willy Tarreau', David Miller
  Cc: netdev@vger.kernel.org, thomas.petazzoni@free-electrons.com,
	gregory.clement@free-electrons.com
In-Reply-To: <20140116200757.GA19969@1wt.eu>

From: Willy Tarreau 
..
> Ah that's an interesting trick! We don't have an IOMMU on this platform
> so the call is cheap. However, it relies on an I/O barrier to wait for
> a cache snoop completion. From what I read, it's not needed when going
> to the device. But when going to the CPU for the Rx case, we're calling
> it for every packet which is counter productive because I'd like to do
> it only once when entering the rx loop and avoid any other call during
> the loop. I measured that I could save 300 ns per packet by doing this
> (thus slightly modifying the DMA API to add a new dma_io_barrier function).
> 
> I think it could be interesting (at least for this platform, I don't know
> if other platforms work with cache coherent DMA which only require to
> verify end of snooping), to have two distinct set of DMA calls, something
> like this :
> 
>     rx_loop(napi, budget)
>     {
>          dma_wait_for_snoop_completion(dev);
>          ...
>          for_each_rx_packet(pkt) {
>              dma_sync_single_range_for_cpu_unless_completed(dev, pkt->addr);
>              /* use packet */
>          }
>     }

You'd need to scan the rx ring for completed entries before the waiting
for the snoop to complete - and then only process those entries.
(Or read a block of rx ring entries...)

Otherwise you might try to process a rx that finished after you
waited for the snoop to complete.

	David

^ permalink raw reply

* RE: [PATCH 11/13] net: mvneta: implement rx_copybreak
From: David Laight @ 2014-01-17  9:32 UTC (permalink / raw)
  To: 'David Miller', w@1wt.eu
  Cc: netdev@vger.kernel.org, thomas.petazzoni@free-electrons.com,
	gregory.clement@free-electrons.com
In-Reply-To: <20140116.114905.1582218754259744747.davem@davemloft.net>

From: David Miller
> I don't know if this is relevant for your hardware, but there is a trick
> with IOMMUs that I at least at one point was doing on sparc64.
> 
> I never flushed the IOMMU mappings explicitly on a dma_unmap call.
> 
> Instead I always allocate by moving the allocation cursor to higher
> addresses in the IOMMU range looking for free space.
> 
> Then when I hit the end of the range and had to wrap the allocation
> cursor back to the beginning, I flush the entire IOMMU.
> 
> So for %99 of IOMMU mapping creation and teardown calls, I didn't even
> touch the hardware.

Another option that ought to work is splitting the allocation of
the io address space and the mapping of the addresses to physical
memory.
So for rx you'd leave the rx ring pointers constant and change the
iommu to refer to the correct buffer.

	David

^ permalink raw reply

* [PATCH net] net: core: orphan frags before queuing to slow qdisc
From: Jason Wang @ 2014-01-17  9:42 UTC (permalink / raw)
  To: davem, netdev, linux-kernel; +Cc: Jason Wang, Michael S. Tsirkin

Many qdiscs can queue a packet for a long time, this will lead an issue
with zerocopy skb. It means the frags will not be orphaned in an expected
short time, this breaks the assumption that virtio-net will transmit the
packet in time.

So if guest packets were queued through such kind of qdisc and hit the
limitation of the max pending packets for virtio/vhost. All packets that
go to another destination from guest will also be blocked.

A case for reproducing the issue:

- Boot two VMs and connect them to the same bridge kvmbr.
- Setup tbf with a very low rate/burst on eth0 which is a port of kvmbr.
- Let VM1 send lots of packets thorugh eth0
- After a while, VM1 is unable to send any packets out since the number of
  pending packets (queued to tbf) were exceeds the limitation of vhost/virito

Solve this issue by orphaning the frags before queuing it to a slow qdisc (the
one without TCQ_F_CAN_BYPASS).

Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 net/core/dev.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 0ce469e..1209774 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2700,6 +2700,12 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 	contended = qdisc_is_running(q);
 	if (unlikely(contended))
 		spin_lock(&q->busylock);
+	if (!(q->flags & TCQ_F_CAN_BYPASS) &&
+	    unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) {
+		kfree_skb(skb);
+		rc = NET_XMIT_DROP;
+		goto out;
+	}

 	spin_lock(root_lock);
 	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
@@ -2739,6 +2745,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 		}
 	}
 	spin_unlock(root_lock);
+out:
 	if (unlikely(contended))
 		spin_unlock(&q->busylock);
 	return rc;
-- 
1.8.3.2

^ permalink raw reply related

* Re: [PATCH 11/13] net: mvneta: implement rx_copybreak
From: Willy Tarreau @ 2014-01-17  9:48 UTC (permalink / raw)
  To: David Laight
  Cc: David Miller, netdev@vger.kernel.org,
	thomas.petazzoni@free-electrons.com,
	gregory.clement@free-electrons.com
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D45EA2B@AcuExch.aculab.com>

On Fri, Jan 17, 2014 at 09:28:13AM +0000, David Laight wrote:
> From: Willy Tarreau 
> ..
> > Ah that's an interesting trick! We don't have an IOMMU on this platform
> > so the call is cheap. However, it relies on an I/O barrier to wait for
> > a cache snoop completion. From what I read, it's not needed when going
> > to the device. But when going to the CPU for the Rx case, we're calling
> > it for every packet which is counter productive because I'd like to do
> > it only once when entering the rx loop and avoid any other call during
> > the loop. I measured that I could save 300 ns per packet by doing this
> > (thus slightly modifying the DMA API to add a new dma_io_barrier function).
> > 
> > I think it could be interesting (at least for this platform, I don't know
> > if other platforms work with cache coherent DMA which only require to
> > verify end of snooping), to have two distinct set of DMA calls, something
> > like this :
> > 
> >     rx_loop(napi, budget)
> >     {
> >          dma_wait_for_snoop_completion(dev);
> >          ...
> >          for_each_rx_packet(pkt) {
> >              dma_sync_single_range_for_cpu_unless_completed(dev, pkt->addr);
> >              /* use packet */
> >          }
> >     }
> 
> You'd need to scan the rx ring for completed entries before the waiting
> for the snoop to complete - and then only process those entries.
> (Or read a block of rx ring entries...)
> 
> Otherwise you might try to process a rx that finished after you
> waited for the snoop to complete.

That's already the case, the number of descs to use is computed first
thing when entering the function. That's why it's already safe.

Willy

^ permalink raw reply

* RE: [PATCH RFC 1/6] net: rfkill: gpio: fix gpio name buffer size off by 1
From: David Laight @ 2014-01-17  9:46 UTC (permalink / raw)
  To: 'Chen-Yu Tsai', Johannes Berg, David S. Miller
  Cc: netdev@vger.kernel.org, devicetree@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-sunxi@googlegroups.com, Maxime Ripard,
	linux-wireless@vger.kernel.org
In-Reply-To: <1389941251-32692-2-git-send-email-wens@csie.org>

From: Chen-Yu Tsai
> snprintf should be passed the complete size of the buffer, including
> the space for '\0'. The previous code resulted in the *_reset and
> *_shutdown strings being truncated.
...
> diff --git a/net/rfkill/rfkill-gpio.c b/net/rfkill/rfkill-gpio.c
...
> -	snprintf(rfkill->reset_name, len + 6 , "%s_reset", rfkill->name);
> -	snprintf(rfkill->shutdown_name, len + 9, "%s_shutdown", rfkill->name);
> +	snprintf(rfkill->reset_name, len + 7 , "%s_reset", rfkill->name);
> +	snprintf(rfkill->shutdown_name, len + 10, "%s_shutdown", rfkill->name);

I can't find the context for the above, but they look very dubious.
I'd expect: snprintf(foo, sizeof foo, ...).
If you are trying to truncate rfkill->name you need to use %.*s.

	David

^ permalink raw reply

* Re: [PATCH net-next] virtio-net: fix build error when CONFIG_AVERAGE is not enabled
From: Jason Wang @ 2014-01-17  9:51 UTC (permalink / raw)
  To: Michael Dalton, David S. Miller
  Cc: netdev, Eric Dumazet, virtualization, Michael S. Tsirkin
In-Reply-To: <1389950828-24039-1-git-send-email-mwdalton@google.com>

On 01/17/2014 05:27 PM, Michael Dalton wrote:
> Commit ab7db91705e9 ("virtio-net: auto-tune mergeable rx buffer size for
> improved performance") introduced a virtio-net dependency on EWMA.
> The inclusion of EWMA is controlled by CONFIG_AVERAGE. Fix build error
> when CONFIG_AVERAGE is not enabled by adding select AVERAGE to
> virtio-net's Kconfig entry.
>
> Build failure reported using config make ARCH=s390 defconfig.
>
> Signed-off-by: Michael Dalton <mwdalton@google.com>
> ---
>  drivers/net/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index b45b240..f342278 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -236,6 +236,7 @@ config VETH
>  config VIRTIO_NET
>  	tristate "Virtio network driver"
>  	depends on VIRTIO
> +	select AVERAGE
>  	---help---
>  	  This is the virtual network driver for virtio.  It can be used with
>  	  lguest or QEMU based VMMs (like KVM or Xen).  Say Y or M.

Acked-by: Jason Wang <jasowang@redhat.com>

^ permalink raw reply

* RE: [PATCH net-next v2 2/3] net: add trim helper and convert users
From: David Laight @ 2014-01-17  9:52 UTC (permalink / raw)
  To: 'Hannes Frederic Sowa', netdev@vger.kernel.org
  Cc: Daniel Borkmann, Jakub Zawadzki, Eric Dumazet,
	linux-kernel@vger.kernel.org
In-Reply-To: <1389918519-23779-3-git-send-email-hannes@stressinduktion.org>

From: Hannes Frederic Sowa
...
> +/**
> + * trim - perform a reciprocal multiplication in order to "clamp" a
> + *        value into range [0, ep_ro), where the upper interval
> + *        endpoint is right-open. This is useful, e.g. for accessing
> + *        a index of an array containing ep_ro elements, for example.
> + *        Think of it as sort of modulus, only that the result isn't
> + *        that of modulo. ;)
> + *        More: http://homepage.cs.uiowa.edu/~jones/bcd/divide.html

It isn't appropriate to put urls into code comments (or commit messages).
They are very likely to get out of date.

	David

^ permalink raw reply

* Re: [PATCH net-next] bonding: trivial: rename slave->jiffies to ->last_link_up
From: Veaceslav Falico @ 2014-01-17  9:53 UTC (permalink / raw)
  To: netdev; +Cc: Jay Vosburgh, Andy Gospodarek
In-Reply-To: <1389838837-10283-1-git-send-email-vfalico@redhat.com>

On Thu, Jan 16, 2014 at 03:20:37AM +0100, Veaceslav Falico wrote:
>slave->jiffies is updated every time the slave becomes active, which, for
>bonding, means that its link is 'up'.
>
>CC: Jay Vosburgh <fubar@us.ibm.com>
>CC: Andy Gospodarek <andy@greyhouse.net>
>Signed-off-by: Veaceslav Falico <vfalico@redhat.com>

Self-NAK, as the first patchset was dropped and will be sent as v2, I'll
just slip this patch in there.

Sorry for the noise.

>---
>
>Notes:
>    On top of
>
>    [PATCH net-next 0/6] bonding: only rely on arp packets if arp monitor is used
>
>    Sorry for the trivial patch, but I'm always loosing some time
>    trying to remember what that actually means.
>
> drivers/net/bonding/bond_main.c | 20 ++++++++++----------
> drivers/net/bonding/bonding.h   |  3 ++-
> 2 files changed, 12 insertions(+), 11 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index f9e5512..0f613ae 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -846,7 +846,7 @@ void bond_change_active_slave(struct bonding *bond, struct slave *new_active)
> 		return;
>
> 	if (new_active) {
>-		new_active->jiffies = jiffies;
>+		new_active->last_link_up = jiffies;
>
> 		if (new_active->link == BOND_LINK_BACK) {
> 			if (USES_PRIMARY(bond->params.mode)) {
>@@ -1488,7 +1488,7 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
> 	}
>
> 	if (new_slave->link != BOND_LINK_DOWN)
>-		new_slave->jiffies = jiffies;
>+		new_slave->last_link_up = jiffies;
> 	pr_debug("Initial state of slave_dev is BOND_LINK_%s\n",
> 		new_slave->link == BOND_LINK_DOWN ? "DOWN" :
> 			(new_slave->link == BOND_LINK_UP ? "UP" : "BACK"));
>@@ -1923,7 +1923,7 @@ static int bond_miimon_inspect(struct bonding *bond)
> 				 * recovered before downdelay expired
> 				 */
> 				slave->link = BOND_LINK_UP;
>-				slave->jiffies = jiffies;
>+				slave->last_link_up = jiffies;
> 				pr_info("%s: link status up again after %d ms for interface %s.\n",
> 					bond->dev->name,
> 					(bond->params.downdelay - slave->delay) *
>@@ -1998,7 +1998,7 @@ static void bond_miimon_commit(struct bonding *bond)
>
> 		case BOND_LINK_UP:
> 			slave->link = BOND_LINK_UP;
>-			slave->jiffies = jiffies;
>+			slave->last_link_up = jiffies;
>
> 			if (bond->params.mode == BOND_MODE_8023AD) {
> 				/* prevent it from being the active one */
>@@ -2345,7 +2345,7 @@ int bond_arp_rcv(const struct sk_buff *skb, struct bonding *bond,
> 		bond_validate_arp(bond, slave, sip, tip);
> 	else if (bond->curr_active_slave &&
> 		 time_after(slave_last_arp_rx(bond, bond->curr_active_slave),
>-			    bond->curr_active_slave->jiffies))
>+			    bond->curr_active_slave->last_link_up))
> 		bond_validate_arp(bond, slave, tip, sip);
>
> out_unlock:
>@@ -2392,9 +2392,9 @@ static void bond_loadbalance_arp_mon(struct work_struct *work)
> 	oldcurrent = ACCESS_ONCE(bond->curr_active_slave);
> 	/* see if any of the previous devices are up now (i.e. they have
> 	 * xmt and rcv traffic). the curr_active_slave does not come into
>-	 * the picture unless it is null. also, slave->jiffies is not needed
>-	 * here because we send an arp on each slave and give a slave as
>-	 * long as it needs to get the tx/rx within the delta.
>+	 * the picture unless it is null. also, slave->last_link_up is not
>+	 * needed here because we send an arp on each slave and give a slave
>+	 * as long as it needs to get the tx/rx within the delta.
> 	 * TODO: what about up/down delay in arp mode? it wasn't here before
> 	 *       so it can wait
> 	 */
>@@ -2516,7 +2516,7 @@ static int bond_ab_arp_inspect(struct bonding *bond)
> 		 * active.  This avoids bouncing, as the last receive
> 		 * times need a full ARP monitor cycle to be updated.
> 		 */
>-		if (bond_time_in_interval(bond, slave->jiffies, 2))
>+		if (bond_time_in_interval(bond, slave->last_link_up, 2))
> 			continue;
>
> 		/*
>@@ -2709,7 +2709,7 @@ static void bond_ab_arp_probe(struct bonding *bond)
> 	new_slave->link = BOND_LINK_BACK;
> 	bond_set_slave_active_flags(new_slave);
> 	bond_arp_send_all(bond, new_slave);
>-	new_slave->jiffies = jiffies;
>+	new_slave->last_link_up = jiffies;
> 	rcu_assign_pointer(bond->current_arp_slave, new_slave);
> }
>
>diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
>index 99126b2..9f07af1 100644
>--- a/drivers/net/bonding/bonding.h
>+++ b/drivers/net/bonding/bonding.h
>@@ -184,7 +184,8 @@ struct slave {
> 	struct net_device *dev; /* first - useful for panic debug */
> 	struct bonding *bond; /* our master */
> 	int    delay;
>-	unsigned long jiffies;
>+	/* all three in jiffies */
>+	unsigned long last_link_up;
> 	unsigned long last_arp_rx;
> 	unsigned long target_last_arp_rx[BOND_MAX_ARP_TARGETS];
> 	s8     link;    /* one of BOND_LINK_XXXX */
>-- 
>1.8.4
>

^ permalink raw reply

* Re: [PATCH RFC 1/6] net: rfkill: gpio: fix gpio name buffer size off by 1
From: Chen-Yu Tsai @ 2014-01-17  9:59 UTC (permalink / raw)
  To: David Laight
  Cc: Johannes Berg, David S. Miller,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	linux-sunxi-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org,
	Maxime Ripard,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D45EA9D-VkEWCZq2GCInGFn1LkZF6NBPR1lH4CV8@public.gmane.org>

On Fri, Jan 17, 2014 at 5:46 PM, David Laight <David.Laight-JxhZ9S5GRejQT0dZR+AlfA@public.gmane.org> wrote:
> From: Chen-Yu Tsai
>> snprintf should be passed the complete size of the buffer, including
>> the space for '\0'. The previous code resulted in the *_reset and
>> *_shutdown strings being truncated.
> ...
>> diff --git a/net/rfkill/rfkill-gpio.c b/net/rfkill/rfkill-gpio.c
> ...
>> -     snprintf(rfkill->reset_name, len + 6 , "%s_reset", rfkill->name);
>> -     snprintf(rfkill->shutdown_name, len + 9, "%s_shutdown", rfkill->name);
>> +     snprintf(rfkill->reset_name, len + 7 , "%s_reset", rfkill->name);
>> +     snprintf(rfkill->shutdown_name, len + 10, "%s_shutdown", rfkill->name);
>
> I can't find the context for the above, but they look very dubious.
> I'd expect: snprintf(foo, sizeof foo, ...).
> If you are trying to truncate rfkill->name you need to use %.*s.

The driver allocates these buffers on the fly, a few lines above:

        len = strlen(rfkill->name);
        rfkill->reset_name = devm_kzalloc(&pdev->dev, len + 7, GFP_KERNEL);
        rfkill->shutdown_name = devm_kzalloc(&pdev->dev, len + 10, GFP_KERNEL);

I am not trying to truncate rfkill->name. Rather, the buffer length passed
to snprintf was wrong, so the resulting name was truncated by one character.


Thanks,
ChenYu

^ permalink raw reply

* Re: [PATCH net-next v2 2/3] net: add trim helper and convert users
From: Daniel Borkmann @ 2014-01-17 10:06 UTC (permalink / raw)
  To: David Laight
  Cc: 'Hannes Frederic Sowa', netdev@vger.kernel.org,
	Jakub Zawadzki, Eric Dumazet, linux-kernel@vger.kernel.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D45EAC0@AcuExch.aculab.com>

On 01/17/2014 10:52 AM, David Laight wrote:
> From: Hannes Frederic Sowa
> ...
>> +/**
>> + * trim - perform a reciprocal multiplication in order to "clamp" a
>> + *        value into range [0, ep_ro), where the upper interval
>> + *        endpoint is right-open. This is useful, e.g. for accessing
>> + *        a index of an array containing ep_ro elements, for example.
>> + *        Think of it as sort of modulus, only that the result isn't
>> + *        that of modulo. ;)
>> + *        More: http://homepage.cs.uiowa.edu/~jones/bcd/divide.html
>
> It isn't appropriate to put urls into code comments (or commit messages).
> They are very likely to get out of date.

Then please grep the git log for some references/urls, you'll find plenty.
This link is just a pointer for people interested to read some more
background; there's no need to overly elaborate in the kernel doc right
here.

Thanks !

> 	David
>
>
>

^ permalink raw reply

* RE: [PATCH net-next 2/2] reciprocal_divide: correction/update of the algorithm
From: David Laight @ 2014-01-17 10:05 UTC (permalink / raw)
  To: 'Hannes Frederic Sowa', Christoph Lameter
  Cc: Daniel Borkmann, davem@davemloft.net, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, Eric Dumazet, Austin S Hemmelgarn,
	Jesse Gross, Jamal Hadi Salim, Stephen Hemminger, Matt Mackall,
	Pekka Enberg, Andy Gospodarek, Veaceslav Falico, Jay Vosburgh,
	Jakub Zawadzki
In-Reply-To: <20140116172406.GA17529@order.stressinduktion.org>

From: Hannes Frederic Sowa
> On Thu, Jan 16, 2014 at 10:37:37AM -0600, Christoph Lameter wrote:
> > On Thu, 16 Jan 2014, Daniel Borkmann wrote:
> >
> > > - * or else the performance is slower than a normal divide.
> > > - */
> > > -extern u32 reciprocal_value(u32 B);
> > > +struct reciprocal_value {
> > > +	u32 m;
> > > +	u8 sh1, sh2;
> > > +};
> > >
> > > +#define RECIPROCAL_VALUE_RESULT_TO_ZERO ((struct reciprocal_value){.sh1 = 32})
> > >
> > > -static inline u32 reciprocal_divide(u32 A, u32 R)
> > > +struct reciprocal_value reciprocal_value(u32 d);
> >
> > A function that returns a struct? That works? Which gcc versions support
> > it?
> 
> Sure, that works and I actually like it. This is supported by the c standard,
> but please don't ask me since which one. ;)

I think it has always been supported.
Originally by adding an extra hidden argument that points to an on-stack
structure.
More recent ABI tend to allow 'small' structures be returned in registers.
(Typically ones that would fit in the register(s) used for an integral result).

This doesn't mean that it is a good idea!

OTOH if the function is 'static inline' (and inlined) it probably doesn't matter.

	David


^ permalink raw reply

* Re: [PATCH 4/11]  use ether_addr_equal_64bits
From: Dan Carpenter @ 2014-01-17 10:18 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Henrique de Moraes Holschuh, Julia Lawall,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA, Emmanuel Grumbach,
	Intel Linux Wireless, John W. Linville,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20140106104802.GN30234@mwanda>

We're worried about reading beyond the end of the array and it's a heap
allocation and the last char of the eth addr is the last byte of the
page.  This causes an oops.

It's almost impossible to hit that bug.

1) You would have to have the eth addr at the end of the array.
2) It would have to be a packed struct.
3) The struct size would have to be a multiple of 4 because otherwise we
   can't put it at the end of the page.
4) It would need to be allocated on the heap.

You add all those up which is pretty rare so I wasn't able to find
anything like that.  Then you have to get extremely unlucky.

The closest thing I could find were a couple places like like:

static struct mac_addr null_mac_addr = { { 0, 0, 0, 0, 0, 0 } };

It meets criteria 1 and 2 but not 3 and 4.

regards,
dan carpenter

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH v2] sch_htb: let skb->priority refer to non-leaf class
From: Harry Mason @ 2014-01-17 10:19 UTC (permalink / raw)
  To: Eric Dumazet, Jamal Hadi Salim; +Cc: linux-netdev
In-Reply-To: <1389889520.31367.403.camel@edumazet-glaptop2.roam.corp.google.com>

If the class in skb->priority is not a leaf, apply filters from the
selected class, not the qdisc. This lets netfilter or user space
partially classify the packet.

Signed-off-by: Harry Mason <harry.mason@smoothwall.net>
---

On Thu, 2014-01-16 at 08:25 -0800, Eric Dumazet wrote:
> On Thu, 2014-01-16 at 14:45 +0000, Harry Mason wrote:
> 
>> + /* Start with inner filter chain if a non-leaf class is selected */
>> + if (cl)
>> +     tcf = cl->filter_list;
>> + else
>> +     tcf = q->filter_list;
> 
> Could this break some existing htb setups ?

I think it is unlikely. Setting skb->priority to a non-leaf class would
be equivalent to setting it to the base qdisc. In theory an application
might rely on this if it expects the classes to be dynamic, but adding
a filter could restore the old behaviour.

To me this is intuitively how it should behave, and reproduces what would
happen if a tc filter instead of netfilter had first assigned the
non-leaf class.

> Also we test cl being NULL at line 222, it would be nice to not
> test it again...

Updated below.

 net/sched/sch_htb.c |   10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index 717b210..8073d92 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -219,11 +219,15 @@ static struct htb_class *htb_classify(struct sk_buff *skb, struct Qdisc *sch,
 	if (skb->priority == sch->handle)
 		return HTB_DIRECT;	/* X:0 (direct flow) selected */
 	cl = htb_find(skb->priority, sch);
-	if (cl && cl->level == 0)
-		return cl;
+	if (cl) {
+		if (cl->level == 0)
+			return cl;
+		/* Start with inner filter chain if a non-leaf class is selected */
+		tcf = cl->filter_list;
+	} else
+		tcf = q->filter_list;
 
 	*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
-	tcf = q->filter_list;
 	while (tcf && (result = tc_classify(skb, tcf, &res)) >= 0) {
 #ifdef CONFIG_NET_CLS_ACT
 		switch (result) {
-- 
1.7.10.4

^ permalink raw reply related

* RE: PROBLEM: usbnet / ax88179_178a: Panic in usb_hcd_map_urb_for_dma
From: David Laight @ 2014-01-17 10:38 UTC (permalink / raw)
  To: 'Thomas Kear', netdev@vger.kernel.org
In-Reply-To: <CAHNSM4Kv5gcJn=H4k8QjPggWzjF0zRm_t_icAYXjCTqXx4xL2w@mail.gmail.com>

From: Thomas Kear
> USB3 gigabit ethernet adapters with the ASIX AX88179 chipset (LevelOne
> USB0401-V3, Plugable USB3-E1000, SIIG JU-NE0211-S1 and others) are
> experiencing kernel panics in usb_hcd_map_urb_for_dma since 3.12...

I've worked out why I didn't hit that in my ax8179_18a testing.
I was running with a patch to xhci-ring.c that sends the zero length
packet, so the usbnet code wasn't adding the extra zero byte.

That patch is somewhere in Sarah's 'pending' queue.

	David

^ permalink raw reply

* [PATCH net-next] net: vxlan: do not use vxlan_net before checking event type
From: Daniel Borkmann @ 2014-01-17 11:55 UTC (permalink / raw)
  To: davem; +Cc: netdev, Eric W. Biederman, Jesse Brandeburg, Cong Wang

Jesse Brandeburg reported that commit acaf4e70997f caused a panic
when adding a network namespace while vxlan module was present in
the system:

[<ffffffff814d0865>] vxlan_lowerdev_event+0xf5/0x100
[<ffffffff816e9e5d>] notifier_call_chain+0x4d/0x70
[<ffffffff810912be>] __raw_notifier_call_chain+0xe/0x10
[<ffffffff810912d6>] raw_notifier_call_chain+0x16/0x20
[<ffffffff815d9610>] call_netdevice_notifiers_info+0x40/0x70
[<ffffffff815d9656>] call_netdevice_notifiers+0x16/0x20
[<ffffffff815e1bce>] register_netdevice+0x1be/0x3a0
[<ffffffff815e1dce>] register_netdev+0x1e/0x30
[<ffffffff814cb94a>] loopback_net_init+0x4a/0xb0
[<ffffffffa016ed6e>] ? lockd_init_net+0x6e/0xb0 [lockd]
[<ffffffff815d6bac>] ops_init+0x4c/0x150
[<ffffffff815d6d23>] setup_net+0x73/0x110
[<ffffffff815d725b>] copy_net_ns+0x7b/0x100
[<ffffffff81090e11>] create_new_namespaces+0x101/0x1b0
[<ffffffff81090f45>] copy_namespaces+0x85/0xb0
[<ffffffff810693d5>] copy_process.part.26+0x935/0x1500
[<ffffffff811d5186>] ? mntput+0x26/0x40
[<ffffffff8106a15c>] do_fork+0xbc/0x2e0
[<ffffffff811b7f2e>] ? ____fput+0xe/0x10
[<ffffffff81089c5c>] ? task_work_run+0xac/0xe0
[<ffffffff8106a406>] SyS_clone+0x16/0x20
[<ffffffff816ee689>] stub_clone+0x69/0x90
[<ffffffff816ee329>] ? system_call_fastpath+0x16/0x1b

Apparently loopback device is being registered first and thus we
receive an event notification when vxlan_net is not ready. Hence,
when we call net_generic() and request vxlan_net_id, we seem to
access garbage at that point in time. In setup_net() where we set
up a newly allocated network namespace, we traverse the list of
pernet ops ...

list_for_each_entry(ops, &pernet_list, list) {
	error = ops_init(ops, net);
	if (error < 0)
		goto out_undo;
}

... and loopback_net_init() is invoked first here, so in the middle
of setup_net() we get this notification in vxlan. As currently we
only care about devices that unregister, move access through
net_generic() there. Fix is based on Cong Wang's proposal, but
only changes what is needed here. It sucks a bit as we only work
around the actual cure: right now it seems the only way to check if
a netns actually finished traversing all init ops would be to check
if it's part of net_namespace_list. But that I find quite expensive
each time we go through a notifier callback. Anyway, did a couple
of tests and it seems good for now.

Fixes: acaf4e70997f ("net: vxlan: when lower dev unregisters remove vxlan dev as well")
Reported-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 drivers/net/vxlan.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index a2dee80..d6ec71f 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2681,10 +2681,12 @@ static int vxlan_lowerdev_event(struct notifier_block *unused,
 				unsigned long event, void *ptr)
 {
 	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
-	struct vxlan_net *vn = net_generic(dev_net(dev), vxlan_net_id);
+	struct vxlan_net *vn;

-	if (event == NETDEV_UNREGISTER)
+	if (event == NETDEV_UNREGISTER) {
+		vn = net_generic(dev_net(dev), vxlan_net_id);
 		vxlan_handle_lowerdev_unregister(vn, dev);
+	}

 	return NOTIFY_DONE;
 }
-- 
1.7.11.7

^ permalink raw reply related

* Re: [Xen-devel] [PATCH net-next v2] xen-netfront: clean up code in xennet_release_rx_bufs
From: Wei Liu @ 2014-01-17 12:08 UTC (permalink / raw)
  To: annie li
  Cc: David Vrabel, wei.liu2, ian.campbell, netdev, xen-devel,
	andrew.bennieston, davem
In-Reply-To: <52D8CCE4.9010804@oracle.com>

On Fri, Jan 17, 2014 at 02:25:40PM +0800, annie li wrote:
> 
> On 2014/1/16 19:10, David Vrabel wrote:
> >On 15/01/14 23:57, Annie Li wrote:
> >>This patch implements two things:
> >>
> >>* release grant reference and skb for rx path, this fixex resource leaking.
> >>* clean up grant transfer code kept from old netfront(2.6.18) which grants
> >>pages for access/map and transfer. But grant transfer is deprecated in current
> >>netfront, so remove corresponding release code for transfer.
> >>
> >>gnttab_end_foreign_access_ref may fail when the grant entry is currently used
> >>for reading or writing. But this patch does not cover this and improvement for
> >>this failure may be implemented in a separate patch.
> >I don't think replacing a resource leak with a security bug is a good idea.
> >
> >If you would prefer not to fix the gnttab_end_foreign_access() call, I
> >think you can fix this in netfront by taking a reference to the page
> >before calling gnttab_end_foreign_access().  This will ensure the page
> >isn't freed until the subsequent kfree_skb(), or the gref is released by
> >the foreign domain (whichever is later).
> 
> Taking a reference to the page before calling
> gnttab_end_foreign_access() delays the free work until kfree_skb().
> Simply adding put_page before kfree_skb() does not make things
> different from gnttab_end_foreign_access_ref(), and the pages will
> be freed by kfree_skb(), problem will be hit in
> gnttab_handle_deferred() when freeing pages which already be freed.
> 

I think David's idea is:

	get_page
	gnttab_end_foreign_access
	kfree_skb

The get_page is to offset put_page in gnttab_end_foreign_access. You
don't need to put page before kfree_skb.

Wei.

> So put_page is required in gnttab_end_foreign_access(), this will
> ensure either free is taken by kfree_skb or gnttab_handle_deferred.
> This involves changes in blkfront/pcifront/tpmfront(just like your
> patch), this way ensure page is released when ref is end.
> 
> Another solution I am thinking is calling
> gnttab_end_foreign_access() with page parameter as NULL, then
> gnttab_end_foreign_access will only do ending grant reference work
> and releasing page work is done by kfree_skb().
> 
> Thanks
> Annie

^ permalink raw reply

* Re: ipv6: default route for link local address is not added while assigning a address
From: Hannes Frederic Sowa @ 2014-01-17 12:16 UTC (permalink / raw)
  To: sohny thomas; +Cc: netdev, linux-kernel, yoshfuji, davem, kumuda
In-Reply-To: <52D8EB12.6070901@gmail.com>

Hi!

On Fri, Jan 17, 2014 at 02:04:26PM +0530, sohny thomas wrote:
> Any updates on my reply, Any more info is required.
> Can this be pulled into the kernel tree?

Your patch was posted to the wrong mailing list. Please repost to
netdev@vger.kernel.org so <http://patchwork.ozlabs.org/project/netdev/list/>
picks it up.

Thanks,

  Hannes

^ permalink raw reply

* Re: [PATCH V3 net-next 1/3] ipv6: add the IPV6_FL_F_REFLECT flag to IPV6_FL_A_GET
From: Florent Fourcot @ 2014-01-17 12:16 UTC (permalink / raw)
  To: netdev, Hannes Frederic Sowa
In-Reply-To: <20140117012722.GA24345@order.stressinduktion.org>

Le 17/01/2014 02:27, Hannes Frederic Sowa a écrit :
> On Thu, Jan 16, 2014 at 05:19:16PM +0100, Florent Fourcot wrote:
>> @@ -1000,6 +1002,8 @@ static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
>>  	ireq = inet_rsk(req);
>>  	ireq->ir_v6_rmt_addr = ipv6_hdr(skb)->saddr;
>>  	ireq->ir_v6_loc_addr = ipv6_hdr(skb)->daddr;
>> +	if (np->repflow)
>> +		np->flow_label = ip6_flowlabel(ipv6_hdr(skb));
>>  	if (!want_cookie || tmp_opt.tstamp_ok)
>>  		TCP_ECN_create_request(req, skb, sock_net(sk));
>>  
> 
> I am not sure here, do you write the flow_label on the listening socket?
> 

Hum, yes. You are right, this is bad.

One alternative is not so simple. Perhaps could we store the flowlabel
in inet_request_sock? It will be available in the tcp_v6_send_synack
function, even in case of retransmission.

^ permalink raw reply

* Re: [PATCH] net: sk == 0xffffffff fix - not for commit
From: Andrzej Pietrasiewicz @ 2014-01-17 12:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: linux-kernel, linux-usb, Kyungmin Park, Felipe Balbi,
	Greg Kroah-Hartman, Marek Szyprowski, Michal Nazarewicz,
	David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev
In-Reply-To: <1389889754.31367.406.camel@edumazet-glaptop2.roam.corp.google.com>

W dniu 16.01.2014 17:29, Eric Dumazet pisze:
> On Thu, 2014-01-16 at 16:21 +0100, Andrzej Pietrasiewicz wrote:
>> W dniu 10.12.2013 15:25, Eric Dumazet pisze:
>>> On Tue, 2013-12-10 at 07:55 +0100, Andrzej Pietrasiewicz wrote:
>>>> W dniu 09.12.2013 16:31, Eric Dumazet pisze:
>>>>> On Mon, 2013-12-09 at 12:47 +0100, Andrzej Pietrasiewicz wrote:
>>>>>> NOT FOR COMMITTING TO MAINLINE.
>>>>>>
>>>>>> With g_ether loaded the sk occasionally becomes 0xffffffff.
>>>>>> It happens usually after transferring few hundreds of kilobytes to few
>>>>>> tens of megabytes. If sk is 0xffffffff then dereferencing it causes
>>>>>> kernel panic.
>>>>>>
>>>>>> This is a *workaround*. I don't know enough net code to understand the core
>>>>>> of the problem. However, with this patch applied the problems are gone,
>>>>>> or at least pushed farther away.
>>>>>
>>>>> Is it happening on SMP or UP ?
>>>>
>>>> UP build, S5PC110
>>>
>>> OK
>>>
>>> I believe you need additional debugging to track the exact moment
>>> 0xffffffff is fed to 'sk'
>>>
>>> It looks like a very strange bug, involving a problem in some assembly
>>> helper, register save/restore, compiler bug or stack corruption or
>>> something.
>>>
>>
>> I started with adding WARN_ON(sk == 0xffffffff); just before return in
>> __inet_lookup_established(), and the problem was gone. So this looks
>> very strange, like a toolchain problem.
>
> Or a timing issue. Adding a WARN_ON() adds extra instructions and might
> really change the assembly output.
>
>>
>> I used gcc-linaro-arm-linux-gnueabihf-4.8-2013.05.
>>
>> If I change the toolchain to
>>
>> gcc-linaro-arm-linux-gnueabihf-4.7-2013.04-20130415
>>
>> the problem seems to have gone away.
>
> Its totally possible some barrier was not properly handled by the
> compiler. You could disassemble the function on both toolchains and
> try to spot the issue.
>

So I gave it a try.

Below is a part of assembly code (ARM) which corresponds to the last
lines of the __inet_lookup_established():

C source:
=========
found:
	rcu_read_unlock();
	return sk;
}

assembly for toolchain 4.7:
===========================
c0333bb8:	ebf4bb6e 	bl	c0062978 <__rcu_read_unlock>
c0333bbc:	e51b0030 	ldr	r0, [fp, #-48]	; 0x30
c0333bc0:	e24bd028 	sub	sp, fp, #40	; 0x28
c0333bc4:	e89daff0 	ldm	sp, {r4, r5, r6, r7, r8, r9, sl, fp, sp, pc}
c0333bc8:	e5132018 	ldr	r2, [r3, #-24]


assembly for toolchain 4.8:
===========================
c033ff5c:	ebf4927e 	bl	c006495c <__rcu_read_unlock>
c033ff60:	e24bd028 	sub	sp, fp, #40	; 0x28
c033ff64:	e51b0030 	ldr	r0, [fp, #-48]	; 0x30
c033ff68:	e89daff0 	ldm	sp, {r4, r5, r6, r7, r8, r9, sl, fp, sp, pc}
c033ff6c:	e5113018 	ldr	r3, [r1, #-24]

What can be seen is that the usage of registers is slightly different,
and, what is more important, the _order_ of ldr/sub is different.
Now, if I swap the instructions at offsets c033ff60 and c033ff64
in the 4.8-generated vmlinux, the problem seems gone! Well, at least
the binary behaves the same way as the 4.7-generated one.

Here is a _hypothesis_ of what _might_ be happening:

The function in question puts its return value in the register r0.
In both cases the return value is fetched from a memory location
relative #-48 to what the frame pointer points to. However,
in the 4.7-generated binary the ldr executes in the branch delay slot,
whereas in the 4.8-generated binary it is the sub which executes
in the branch delay slot. That way, in the 4.7-generated binary the return
value is fetched before __rcu_read_unlock begins, but in the
4.8-generated binary it is fetched some time later. Which might be
enough for someone else to schedule in and break the data to be
copied to r0 and returned from the function.

As I said, this is just a hypothesis.

AP

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox