Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] pkt_sched: act_xt support new Xtables interface
From: Jan Engelhardt @ 2012-12-21 14:35 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Yury Stankevich, Hasan Chowdhury, Stephen Hemminger,
	netdev@vger.kernel.org, pablo, netfilter-devel
In-Reply-To: <50D46928.9070809@mojatatu.com>


On Friday 2012-12-21 14:50, Jamal Hadi Salim wrote:
> On 12-12-21 08:13 AM, Yury Stankevich wrote:

>> i use it ingress,
>> so, i need to use tc xt action
>> to get mark on the packet, before filter on ifb will run.
>> prerouting rule, in turn, used to test if mark was actually restored.
>
> No experience with connmark, but - in order to restore something has
> to store it, correct?

The bigger problem here, if I see __netif_receive_skb right, is that
when ingress rules run, skb->nfct is still unset, thereby the
CONNMARK action is a no-op.

^ permalink raw reply

* Re: perm_addr get
From: Ben Hutchings @ 2012-12-21 14:40 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev
In-Reply-To: <20121221132500.GC2040@minipsycho.orion>

On Fri, 2012-12-21 at 14:25 +0100, Jiri Pirko wrote:
> Fri, Dec 21, 2012 at 02:09:55PM CET, bhutchings@solarflare.com wrote:
> >On Fri, 2012-12-21 at 14:01 +0100, Jiri Pirko wrote:
> >> Hi all.
> >> 
> >> From what I understand dev->perm_addr is set only in case the hw has
> >> permanent hw address somewhere written (for example EPROM).
> >> 
> >> So when I query device which does not have perm_addr set I get:
> >> 
> >> testt1:~$ ethtool -P team0
> >> Permanent address: 00:00:00:00:00:00
> >> 
> >> Is this the correct behaviour? Wouldn't it be more correct if
> >> ethtool_get_perm_addr() fails with -ENOENT for something like that?
> >
> >I don't think we should change the implementation now, as someone might
> >depend on it.  It's trivial to distinguish this not-a-permanent-address
> >case.  However the ethtool command output could be improved.
> 
> Well, not change it even if it is not correct? And by "trivial to distinguish"
> you mean 00:00:00:00:00:00 ~ device has no permanent address?
> But in some cases (like vxge, mac80211) it's possible to see 00:00:00:00:00:00
> by ethtool -P even though the device has permanent address (set later on, after
> register_netdev call).

Oh well, those are just driver bugs that should be fixed.

> I think that ethtool_get_perm_addr should return:
> -ENOENT if dev has no perm addr
> -EAGAIN if dev perm addr hasn't been obtained yet
> 0 (and addr) in other cases
> 	
> And how exactly should be the ethtool output improved?

Print something like 'No permanent address assigned' when the address is
all-zeroes.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH] pkt_sched: act_xt support new Xtables interface
From: Eric Dumazet @ 2012-12-21 15:45 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Jamal Hadi Salim, Yury Stankevich, Hasan Chowdhury,
	Stephen Hemminger, netdev@vger.kernel.org, pablo, netfilter-devel
In-Reply-To: <alpine.LNX.2.01.1212211533360.957@nerf07.vanv.qr>

On Fri, 2012-12-21 at 15:35 +0100, Jan Engelhardt wrote:

> 
> The bigger problem here, if I see __netif_receive_skb right, is that
> when ingress rules run, skb->nfct is still unset, thereby the
> CONNMARK action is a no-op.

Right, ingress is performed before IP/netfilter stack.

This reminds me this might be the reason we have
skb_reset_transport_header(skb);
in __netif_receive_skb(), while its not very logical.

(Yes, sorry for being off topic, but I am referring to 
http://www.spinics.net/lists/netdev/msg214662.html )

^ permalink raw reply

* Re: TUN problems (regression?)
From: Paul Moore @ 2012-12-21 16:27 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jason Wang, netdev
In-Reply-To: <1356046697.21834.3606.camel@edumazet-glaptop>

On Thursday, December 20, 2012 03:38:17 PM Eric Dumazet wrote:
> On Thu, 2012-12-20 at 18:16 -0500, Paul Moore wrote:
> > [CC'ing netdev in case this is a known problem I just missed ...]
> > 
> > Hi Jason,
> > 
> > I started doing some more testing with the multiqueue TUN changes and I
> > ran
> > into a problem when running tunctl: running it once w/o arguments works as
> > expected, but running it a second time results in failure and a
> > kmem_cache_sanity_check() failure.  The problem appears to be very
> > repeatable on my test VM and happens independent of the LSM/SELinux fixup
> > patches.
> > 
> > Have you seen this before?
> 
> Obviously code in tun_flow_init() is wrong...
> 
> static int tun_flow_init(struct tun_struct *tun)
> {
>         int i;
> 
>         tun->flow_cache = kmem_cache_create("tun_flow_cache",
>                                             sizeof(struct tun_flow_entry),
> 0, 0, NULL);
>         if (!tun->flow_cache)
>                 return -ENOMEM;
> ...
> }
> 
> 
> I have no idea why we would need a kmem_cache per tun_struct,
> and why we even need a kmem_cache.
> 
> 
> I would try following patch :
> 
>  drivers/net/tun.c |   24 +++---------------------
>  1 file changed, 3 insertions(+), 21 deletions(-)

Thanks, that solved my problem.  Also, in case you were still curious, I was 
using SLUB.

-- 
paul moore
security and virtualization @ redhat

^ permalink raw reply

* IPv6 over Firewire
From: Stephan Gatzka @ 2012-12-21 17:03 UTC (permalink / raw)
  To: netdev, linux1394-devel

Hi!

I'm currently implementing IPv6 over firewire.

To make the address mapping to the firewire link layer RFC3146 mandates 
to use neighbor discovery with a certain format of the source/target 
link-layer address option.

While it is not too complicated to build that up, I'm wondering how I 
can reserve enough memory in the corresponding skb.

One possibility is to introduce some option padding in 
ndisc_addr_option_pad() but I think that's somehow weird.

The second option I see is to set needed_tailroom in struct netdevice 
for firewire net devices. That's the way I would go for.

Because I'm not really familiar with the whole network infrastructure in 
Linux, a confirmation for the way to go would be nice.

Regards,

Stephan

^ permalink raw reply

* [PATCH] tuntap: dont use a private kmem_cache
From: Eric Dumazet @ 2012-12-21 17:17 UTC (permalink / raw)
  To: Paul Moore, David Miller; +Cc: Jason Wang, netdev, Stephen Hemminger
In-Reply-To: <2090364.S7KStA6R4d@sifl>

From: Eric Dumazet <edumazet@google.com>

Commit 96442e42429 (tuntap: choose the txq based on rxq)
added a per tun_struct kmem_cache.

As soon as several tun_struct are used, we get an error
because two caches cannot have same name.

Use the default kmalloc()/kfree_rcu(), as it reduce code
size and doesn't have performance impact here.

Reported-by: Paul Moore <pmoore@redhat.com>
Tested-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/tun.c |   24 +++---------------------
 1 file changed, 3 insertions(+), 21 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 504f7f1..fbd106e 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -180,7 +180,6 @@ struct tun_struct {
 	int debug;
 #endif
 	spinlock_t lock;
-	struct kmem_cache *flow_cache;
 	struct hlist_head flows[TUN_NUM_FLOW_ENTRIES];
 	struct timer_list flow_gc_timer;
 	unsigned long ageing_time;
@@ -209,8 +208,8 @@ static struct tun_flow_entry *tun_flow_create(struct tun_struct *tun,
 					      struct hlist_head *head,
 					      u32 rxhash, u16 queue_index)
 {
-	struct tun_flow_entry *e = kmem_cache_alloc(tun->flow_cache,
-						    GFP_ATOMIC);
+	struct tun_flow_entry *e = kmalloc(sizeof(*e), GFP_ATOMIC);
+
 	if (e) {
 		tun_debug(KERN_INFO, tun, "create flow: hash %u index %u\n",
 			  rxhash, queue_index);
@@ -223,19 +222,12 @@ static struct tun_flow_entry *tun_flow_create(struct tun_struct *tun,
 	return e;
 }
 
-static void tun_flow_free(struct rcu_head *head)
-{
-	struct tun_flow_entry *e
-		= container_of(head, struct tun_flow_entry, rcu);
-	kmem_cache_free(e->tun->flow_cache, e);
-}
-
 static void tun_flow_delete(struct tun_struct *tun, struct tun_flow_entry *e)
 {
 	tun_debug(KERN_INFO, tun, "delete flow: hash %u index %u\n",
 		  e->rxhash, e->queue_index);
 	hlist_del_rcu(&e->hash_link);
-	call_rcu(&e->rcu, tun_flow_free);
+	kfree_rcu(e, rcu);
 }
 
 static void tun_flow_flush(struct tun_struct *tun)
@@ -833,12 +825,6 @@ static int tun_flow_init(struct tun_struct *tun)
 {
 	int i;
 
-	tun->flow_cache = kmem_cache_create("tun_flow_cache",
-					    sizeof(struct tun_flow_entry), 0, 0,
-					    NULL);
-	if (!tun->flow_cache)
-		return -ENOMEM;
-
 	for (i = 0; i < TUN_NUM_FLOW_ENTRIES; i++)
 		INIT_HLIST_HEAD(&tun->flows[i]);
 
@@ -854,10 +840,6 @@ static void tun_flow_uninit(struct tun_struct *tun)
 {
 	del_timer_sync(&tun->flow_gc_timer);
 	tun_flow_flush(tun);
-
-	/* Wait for completion of call_rcu()'s */
-	rcu_barrier();
-	kmem_cache_destroy(tun->flow_cache);
 }
 
 /* Initialize net device. */

^ permalink raw reply related

* [PATCH] ipv4: arp: fix a lockdep splat in arp_solicit()
From: Eric Dumazet @ 2012-12-21 17:32 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Yan Burman, Stephen Hemminger

From: Eric Dumazet <edumazet@google.com>

Yan Burman reported following lockdep warning :

=============================================
[ INFO: possible recursive locking detected ]
3.7.0+ #24 Not tainted
---------------------------------------------
swapper/1/0 is trying to acquire lock:
  (&n->lock){++--..}, at: [<ffffffff8139f56e>] __neigh_event_send
+0x2e/0x2f0

but task is already holding lock:
  (&n->lock){++--..}, at: [<ffffffff813f63f4>] arp_solicit+0x1d4/0x280

other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&n->lock);
   lock(&n->lock);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

4 locks held by swapper/1/0:
  #0:  (((&n->timer))){+.-...}, at: [<ffffffff8104b350>] 
call_timer_fn+0x0/0x1c0
  #1:  (&n->lock){++--..}, at: [<ffffffff813f63f4>] arp_solicit
+0x1d4/0x280
  #2:  (rcu_read_lock_bh){.+....}, at: [<ffffffff81395400>] 
dev_queue_xmit+0x0/0x5d0
  #3:  (rcu_read_lock_bh){.+....}, at: [<ffffffff813cb41e>] 
ip_finish_output+0x13e/0x640

stack backtrace:
Pid: 0, comm: swapper/1 Not tainted 3.7.0+ #24
Call Trace:
  <IRQ>  [<ffffffff8108c7ac>] validate_chain+0xdcc/0x11f0
  [<ffffffff8108d570>] ? __lock_acquire+0x440/0xc30
  [<ffffffff81120565>] ? kmem_cache_free+0xe5/0x1c0
  [<ffffffff8108d570>] __lock_acquire+0x440/0xc30
  [<ffffffff813c3570>] ? inet_getpeer+0x40/0x600
  [<ffffffff8108d570>] ? __lock_acquire+0x440/0xc30
  [<ffffffff8139f56e>] ? __neigh_event_send+0x2e/0x2f0
  [<ffffffff8108ddf5>] lock_acquire+0x95/0x140
  [<ffffffff8139f56e>] ? __neigh_event_send+0x2e/0x2f0
  [<ffffffff8108d570>] ? __lock_acquire+0x440/0xc30
  [<ffffffff81448d4b>] _raw_write_lock_bh+0x3b/0x50
  [<ffffffff8139f56e>] ? __neigh_event_send+0x2e/0x2f0
  [<ffffffff8139f56e>] __neigh_event_send+0x2e/0x2f0
  [<ffffffff8139f99b>] neigh_resolve_output+0x16b/0x270
  [<ffffffff813cb62d>] ip_finish_output+0x34d/0x640
  [<ffffffff813cb41e>] ? ip_finish_output+0x13e/0x640
  [<ffffffffa046f146>] ? vxlan_xmit+0x556/0xbec [vxlan]
  [<ffffffff813cb9a0>] ip_output+0x80/0xf0
  [<ffffffff813ca368>] ip_local_out+0x28/0x80
  [<ffffffffa046f25a>] vxlan_xmit+0x66a/0xbec [vxlan]
  [<ffffffffa046f146>] ? vxlan_xmit+0x556/0xbec [vxlan]
  [<ffffffff81394a50>] ? skb_gso_segment+0x2b0/0x2b0
  [<ffffffff81449355>] ? _raw_spin_unlock_irqrestore+0x65/0x80
  [<ffffffff81394c57>] ? dev_queue_xmit_nit+0x207/0x270
  [<ffffffff813950c8>] dev_hard_start_xmit+0x298/0x5d0
  [<ffffffff813956f3>] dev_queue_xmit+0x2f3/0x5d0
  [<ffffffff81395400>] ? dev_hard_start_xmit+0x5d0/0x5d0
  [<ffffffff813f5788>] arp_xmit+0x58/0x60
  [<ffffffff813f59db>] arp_send+0x3b/0x40
  [<ffffffff813f6424>] arp_solicit+0x204/0x280
  [<ffffffff813a1a70>] ? neigh_add+0x310/0x310
  [<ffffffff8139f515>] neigh_probe+0x45/0x70
  [<ffffffff813a1c10>] neigh_timer_handler+0x1a0/0x2a0
  [<ffffffff8104b3cf>] call_timer_fn+0x7f/0x1c0
  [<ffffffff8104b350>] ? detach_if_pending+0x120/0x120
  [<ffffffff8104b748>] run_timer_softirq+0x238/0x2b0
  [<ffffffff813a1a70>] ? neigh_add+0x310/0x310
  [<ffffffff81043e51>] __do_softirq+0x101/0x280
  [<ffffffff814518cc>] call_softirq+0x1c/0x30
  [<ffffffff81003b65>] do_softirq+0x85/0xc0
  [<ffffffff81043a7e>] irq_exit+0x9e/0xc0
  [<ffffffff810264f8>] smp_apic_timer_interrupt+0x68/0xa0
  [<ffffffff8145122f>] apic_timer_interrupt+0x6f/0x80
  <EOI>  [<ffffffff8100a054>] ? mwait_idle+0xa4/0x1c0
  [<ffffffff8100a04b>] ? mwait_idle+0x9b/0x1c0
  [<ffffffff8100a6a9>] cpu_idle+0x89/0xe0
  [<ffffffff81441127>] start_secondary+0x1b2/0x1b6


Bug is from arp_solicit(), releasing the neigh lock after arp_send()
In case of vxlan, we eventually need to write lock a neigh lock later.

Its a false positive, but we can get rid of it without lockdep
annotations.

We can instead use neigh_ha_snapshot() helper.

Reported-by: Yan Burman <yanb@mellanox.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
---
 net/ipv4/arp.c |    8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index ce6fbdf..1169ed4 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -321,7 +321,7 @@ static void arp_error_report(struct neighbour *neigh, struct sk_buff *skb)
 static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb)
 {
 	__be32 saddr = 0;
-	u8  *dst_ha = NULL;
+	u8 dst_ha[MAX_ADDR_LEN];
 	struct net_device *dev = neigh->dev;
 	__be32 target = *(__be32 *)neigh->primary_key;
 	int probes = atomic_read(&neigh->probes);
@@ -363,9 +363,9 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb)
 	if (probes < 0) {
 		if (!(neigh->nud_state & NUD_VALID))
 			pr_debug("trying to ucast probe in NUD_INVALID\n");
-		dst_ha = neigh->ha;
-		read_lock_bh(&neigh->lock);
+		neigh_ha_snapshot(dst_ha, neigh, dev);
 	} else {
+		memset(dst_ha, 0, dev->addr_len);
 		probes -= neigh->parms->app_probes;
 		if (probes < 0) {
 #ifdef CONFIG_ARPD
@@ -377,8 +377,6 @@ static void arp_solicit(struct neighbour *neigh, struct sk_buff *skb)
 
 	arp_send(ARPOP_REQUEST, ETH_P_ARP, target, dev, saddr,
 		 dst_ha, dev->dev_addr, NULL);
-	if (dst_ha)
-		read_unlock_bh(&neigh->lock);
 }
 
 static int arp_ignore(struct in_device *in_dev, __be32 sip, __be32 tip)

^ permalink raw reply related

* Re: IPv6 over Firewire
From: YOSHIFUJI Hideaki @ 2012-12-21 17:53 UTC (permalink / raw)
  To: stephan.gatzka; +Cc: netdev, linux1394-devel, YOSHIFUJI Hideaki
In-Reply-To: <50D49659.1000101@gmail.com>

(2012年12月22日 02:03), Stephan Gatzka wrote:

> To make the address mapping to the firewire link layer RFC3146 mandates to use neighbor discovery with a certain format of the source/target link-layer address option.
> 
> While it is not too complicated to build that up, I'm wondering how I can reserve enough memory in the corresponding skb.
> 
> One possibility is to introduce some option padding in ndisc_addr_option_pad() but I think that's somehow weird.
> 
> The second option I see is to set needed_tailroom in struct netdevice for firewire net devices. That's the way I would go for.
> 
> Because I'm not really familiar with the whole network infrastructure in Linux, a confirmation for the way to go would be nice.

If you are talking about how to build NS/NA/RS/Redirect messages, you
can just use ndisc_opt_addr_space() and ndisc_fill_addr_option() here.

--yoshfuji

^ permalink raw reply

* [PATCH 0/2] Re-submit RDS patches
From: Mike Marciniszyn @ 2012-12-21 18:01 UTC (permalink / raw)
  To: venkat.x.venkatsubra-QHcLZuEGTsvQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	rds-devel-N0ozoZBvEnrZJqsBc5GL+g, davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	netdev-u79uwXL29TY76Z2rM5mHXA

These fixes were originally submitted in
https://oss.oracle.com/pipermail/rds-devel/2012-April/thread.html.

The first patch fixes a DOA issue with RDS on qib and should
be a stable patch as well.

The second suppresses a warning message that is misleading when 
a version has been correctly determined.

These two patches were originally acked by the upstream maintainer
and never merged.

---

Mike Marciniszyn (2):
      IB/rds: Correct ib_api use with gs_dma_address/sg_dma_len
      IB/rds: suppress incompatible protocol when version is known

 net/rds/ib_cm.c   |   11 +++++------
 net/rds/ib_recv.c |    9 ++++++---
 2 files changed, 11 insertions(+), 9 deletions(-)

-- 
Thanks!
Mike Marciniszyn
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH 1/2] IB/rds: Correct ib_api use with gs_dma_address/sg_dma_len
From: Mike Marciniszyn @ 2012-12-21 18:01 UTC (permalink / raw)
  To: venkat.x.venkatsubra; +Cc: linux-rdma, roland, rds-devel, davem, netdev
In-Reply-To: <20121221175857.23716.46975.stgit@phlsvslse11.ph.intel.com>

0b088e00 ("RDS: Use page_remainder_alloc() for recv bufs")
added uses of sg_dma_len() and sg_dma_address(). This makes
RDS DOA with the qib driver.

IB ulps should use ib_sg_dma_len() and ib_sg_dma_address
respectively since some HCAs overload ib_sg_dma* operations.

Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
---
 net/rds/ib_recv.c |    9 ++++++---
 1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index 8c5bc85..8eb9501 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -339,8 +339,8 @@ static int rds_ib_recv_refill_one(struct rds_connection *conn,
 	sge->length = sizeof(struct rds_header);
 
 	sge = &recv->r_sge[1];
-	sge->addr = sg_dma_address(&recv->r_frag->f_sg);
-	sge->length = sg_dma_len(&recv->r_frag->f_sg);
+	sge->addr = ib_sg_dma_address(ic->i_cm_id->device, &recv->r_frag->f_sg);
+	sge->length = ib_sg_dma_len(ic->i_cm_id->device, &recv->r_frag->f_sg);
 
 	ret = 0;
 out:
@@ -381,7 +381,10 @@ void rds_ib_recv_refill(struct rds_connection *conn, int prefill)
 		ret = ib_post_recv(ic->i_cm_id->qp, &recv->r_wr, &failed_wr);
 		rdsdebug("recv %p ibinc %p page %p addr %lu ret %d\n", recv,
 			 recv->r_ibinc, sg_page(&recv->r_frag->f_sg),
-			 (long) sg_dma_address(&recv->r_frag->f_sg), ret);
+			 (long) ib_sg_dma_address(
+				ic->i_cm_id->device,
+				&recv->r_frag->f_sg),
+			ret);
 		if (ret) {
 			rds_ib_conn_error(conn, "recv post on "
 			       "%pI4 returned %d, disconnecting and "

^ permalink raw reply related

* [PATCH 2/2] IB/rds: suppress incompatible protocol when version is known
From: Mike Marciniszyn @ 2012-12-21 18:01 UTC (permalink / raw)
  To: venkat.x.venkatsubra; +Cc: linux-rdma, roland, rds-devel, davem, netdev
In-Reply-To: <20121221175857.23716.46975.stgit@phlsvslse11.ph.intel.com>

Add an else to only print the incompatible protocol message
when version hasn't been established.

Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
---
 net/rds/ib_cm.c |   11 +++++------
 1 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index a1e1162..31b74f5 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -434,12 +434,11 @@ static u32 rds_ib_protocol_compatible(struct rdma_cm_event *event)
 		version = RDS_PROTOCOL_3_0;
 		while ((common >>= 1) != 0)
 			version++;
-	}
-	printk_ratelimited(KERN_NOTICE "RDS: Connection from %pI4 using "
-			"incompatible protocol version %u.%u\n",
-			&dp->dp_saddr,
-			dp->dp_protocol_major,
-			dp->dp_protocol_minor);
+	} else
+		printk_ratelimited(KERN_NOTICE "RDS: Connection from %pI4 using incompatible protocol version %u.%u\n",
+				&dp->dp_saddr,
+				dp->dp_protocol_major,
+				dp->dp_protocol_minor);
 	return version;
 }
 

^ permalink raw reply related

* TCP sequence number inference attack on Linux
From: Zhiyun Qian @ 2012-12-21 17:58 UTC (permalink / raw)
  To: netdev

Dear sir or madam,

My name is Zhiyun Qian, a recent PhD graduate from University of
Michigan. As our recent research effort, along with my colleagues, we
identified a vulnerability related to Linux. Details described in our
paper published at this year's ACM Conference on Computer and
Communications Security (CCS): Collaborative TCP Sequence Number
Inference Attack available
http://web.eecs.umich.edu/~zhiyunq/pub/ccs12_TCP_sequence_number_inference.pdf

Keywords: TCP, sequence number inference, DelayedAckLost counter,
privilege-escalation attack

The vulnerability would allow an local malicious program to gain write
access to TCP connections of other applications. An example attack
scenario (on android) would be "an attacker uploads a seemingly benign
app to the google play, when run at the background, it can inject
malicious HTML payload into a webpage open by the browser".

The problem is caused by the common TCP stats counters (the specific
counter I found is DelayedACKLost) maintained by the kernel (but
exposed to user space). By reading and reporting such counters to an
external attacker (colluded), the aforementioned attack can be
accomplished.

It is essentially a side-channel attack (using TCP stats counters to
infer TCP sequence numbers), but it is so real and easy to carry out
that I believe it should be considered a vulnerability. From a
difference perspective, this attack can be considered as a privilege
escalation attack since it allows a local non-privileged program to
gain write access to TCP connections made by other processes.

The lines of code in kernel impacted by the attack is:
net/ipv4/tcp_input.c at line 4166 (as in the latest kernel v3.7.1)

4160 static void tcp_send_dupack(struct sock *sk, const struct sk_buff *skb)
4161 {
4162        struct tcp_sock *tp = tcp_sk(sk);
4163
4164        if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
4165            before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) {
4166                NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_DELAYEDACKLOST);
4167                tcp_enter_quickack_mode(sk);
4168
4169                if (tcp_is_sack(tp) && sysctl_tcp_dsack) {
4170                        u32 end_seq = TCP_SKB_CB(skb)->end_seq;
4171
4172                        if (after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt))
4173                                end_seq = tp->rcv_nxt;
4174                        tcp_dsack_set(sk, TCP_SKB_CB(skb)->seq, end_seq);
4175                }
4176        }
4177
4178        tcp_send_ack(sk);
4179 }

IMHO, an easy fix to the problem is to disallow unprivileged access to
counters such as DelayedAckLost. However, this solution may not be
ideal. A better way is to always enforce both acknowledgement number
and sequence number check on each incoming TCP packet instead of
checking one at a time. Currently, Linux TCP stack first only
validates the SEQ number and then subsequently the ACK number. If the
sequence number is invalid, the delayedAckLost counter will be
incremented (information about sequence number leaked already
regardless of the ACK number). To make attacker's job much harder (we
should require the attacker to guess both the valid sequence number
and ACK number at the same time). For instance, following RFC 5961:
(SND.UNA - MAX.SND.WND) <= SEG.ACK <= SND.NXT, this would require the
attacker to send many times (on the order of 10000) more packets to
conduct the same attack.

Please feel free to ask me any questions regarding this vulnerability.

Best,
-Zhiyun

^ permalink raw reply

* Re: [RFC] IP_MAX_MTU value
From: Rick Jones @ 2012-12-21 18:19 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1356072468.21834.4805.camel@edumazet-glaptop>

On 12/20/2012 10:47 PM, Eric Dumazet wrote:
> Hi David
>
> We have the following definition in net/ipv4/route.c
>
> #define IP_MAX_MTU   0xFFF0
>
> This means that "netperf -t UDP_STREAM", using UDP messages of 65507
> bytes, are fragmented on loopback interface (while its MTU is now 65536
> and should allow those UDP messages being sent without fragments)
>
> I guess Rick chose 65507 bytes in netperf because it was related to the
> max IPv4 datagram length :
>
> 65507 + 28 = 65535

That is correct.  From src/nettest_opmni.c:

/* choosing the default send size is a trifle more complicated than it
    used to be as we have to account for different protocol limits */

#define UDP_LENGTH_MAX (0xFFFF - 28)

static int
choose_send_size(int lss, int protocol) {

   int send_size;

   if (lss > 0) {
     send_size = lss_size;

     /* we will assume that everyone has IPPROTO_UDP and thus avoid an
        issue with Windows using an enum */
     if ((protocol == IPPROTO_UDP) && (send_size > UDP_LENGTH_MAX))
       send_size = UDP_LENGTH_MAX;

   }
   else {
     send_size = 4096;
   }
   return send_size;
}

And I figured that while IPv6 allows even larger sizes, the likelihood 
of it mattering in the then near/medium term was minimal.

> Changing IP_MAX_MTU from 0xFFF0 to 0x10000 seems safe [1], but I might
> miss something really obvious ?

If you go beyond the protocol limit of an IPv4 datagram, won't it be 
necessary to  start being a bit more conditional on IPv4 vs IPv6?

> It might be because in old days we reserved 16 bytes for the ethernet
> header, and we wanted to avoid kmalloc() round-up to kmalloc-131072
> slab ?
>
> If so, we certainly can limit skb->head to 32 or 64 KB and complete with
> page fragments the remaining space.
>
> Thanks
>
> [1] performance increase is ~50%

99 times out of 10 I will assert that faster is better, but do we need 
another 50% for UDP over loopback with that large a message size?

happy benchmarking,

rick jones

^ permalink raw reply

* Re: TCP sequence number inference attack on Linux
From: Eric Dumazet @ 2012-12-21 18:31 UTC (permalink / raw)
  To: Zhiyun Qian; +Cc: netdev
In-Reply-To: <CALvgte86miv450KnOcFRR-oEm_f=qRXarDfQkyU7T3OLqq816A@mail.gmail.com>

On Fri, 2012-12-21 at 12:58 -0500, Zhiyun Qian wrote:
> Dear sir or madam,
> 
> My name is Zhiyun Qian, a recent PhD graduate from University of
> Michigan. As our recent research effort, along with my colleagues, we
> identified a vulnerability related to Linux. Details described in our
> paper published at this year's ACM Conference on Computer and
> Communications Security (CCS): Collaborative TCP Sequence Number
> Inference Attack available
> http://web.eecs.umich.edu/~zhiyunq/pub/ccs12_TCP_sequence_number_inference.pdf
> 
> Keywords: TCP, sequence number inference, DelayedAckLost counter,
> privilege-escalation attack
> 
> The vulnerability would allow an local malicious program to gain write
> access to TCP connections of other applications. An example attack
> scenario (on android) would be "an attacker uploads a seemingly benign
> app to the google play, when run at the background, it can inject
> malicious HTML payload into a webpage open by the browser".
> 
> The problem is caused by the common TCP stats counters (the specific
> counter I found is DelayedACKLost) maintained by the kernel (but
> exposed to user space). By reading and reporting such counters to an
> external attacker (colluded), the aforementioned attack can be
> accomplished.
> 
> It is essentially a side-channel attack (using TCP stats counters to
> infer TCP sequence numbers), but it is so real and easy to carry out
> that I believe it should be considered a vulnerability. From a
> difference perspective, this attack can be considered as a privilege
> escalation attack since it allows a local non-privileged program to
> gain write access to TCP connections made by other processes.
> 
> The lines of code in kernel impacted by the attack is:
> net/ipv4/tcp_input.c at line 4166 (as in the latest kernel v3.7.1)
> 
> 4160 static void tcp_send_dupack(struct sock *sk, const struct sk_buff *skb)
> 4161 {
> 4162        struct tcp_sock *tp = tcp_sk(sk);
> 4163
> 4164        if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
> 4165            before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) {
> 4166                NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_DELAYEDACKLOST);
> 4167                tcp_enter_quickack_mode(sk);
> 4168
> 4169                if (tcp_is_sack(tp) && sysctl_tcp_dsack) {
> 4170                        u32 end_seq = TCP_SKB_CB(skb)->end_seq;
> 4171
> 4172                        if (after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt))
> 4173                                end_seq = tp->rcv_nxt;
> 4174                        tcp_dsack_set(sk, TCP_SKB_CB(skb)->seq, end_seq);
> 4175                }
> 4176        }
> 4177
> 4178        tcp_send_ack(sk);
> 4179 }
> 
> 
> IMHO, an easy fix to the problem is to disallow unprivileged access to
> counters such as DelayedAckLost. However, this solution may not be
> ideal. A better way is to always enforce both acknowledgement number
> and sequence number check on each incoming TCP packet instead of
> checking one at a time. Currently, Linux TCP stack first only
> validates the SEQ number and then subsequently the ACK number. If the
> sequence number is invalid, the delayedAckLost counter will be
> incremented (information about sequence number leaked already
> regardless of the ACK number). To make attacker's job much harder (we
> should require the attacker to guess both the valid sequence number
> and ACK number at the same time). For instance, following RFC 5961:
> (SND.UNA - MAX.SND.WND) <= SEG.ACK <= SND.NXT, this would require the
> attacker to send many times (on the order of 10000) more packets to
> conduct the same attack.
> 
> Please feel free to ask me any questions regarding this vulnerability.


I believe RFC 5961 was implemented in recent linux versions.

Is the described vulnerability still present ?

^ permalink raw reply

* Re: [PATCH] xen/netfront: improve truesize tracking
From: Rick Jones @ 2012-12-21 18:33 UTC (permalink / raw)
  To: Sander Eikelenboom
  Cc: Eric Dumazet, Ian Campbell, netdev@vger.kernel.org,
	Konrad Rzeszutek Wilk, annie li, xen-devel@lists.xensource.com
In-Reply-To: <1609010645.20121221122100@eikelenboom.it>

I'm guessing that trusize checks matter more on the "inbound" path than 
the outbound path?  If that is indeed the case, then instead of, or in 
addition to using the -s option to set the local (netperf side) socket 
buffer size, you should use a -S option to set the remote (netserver 
side) socket buffer size.

happy benchmarking,

rick jones

^ permalink raw reply

* Re: [RFC] IP_MAX_MTU value
From: Eric Dumazet @ 2012-12-21 18:34 UTC (permalink / raw)
  To: Rick Jones; +Cc: David Miller, netdev
In-Reply-To: <50D4A84D.1010402@hp.com>

On Fri, 2012-12-21 at 10:19 -0800, Rick Jones wrote:

> If you go beyond the protocol limit of an IPv4 datagram, won't it be 
> necessary to  start being a bit more conditional on IPv4 vs IPv6?
> 

This IP_MAX_MTU is really an IPv4 thing (static to net/ipv4/route.c)


> 
> 99 times out of 10 I will assert that faster is better, but do we need 
> another 50% for UDP over loopback with that large a message size?

Well, I only didnt understand why sending 65507 UDP messages had to use
fragments. I didnt care of performance at this point, only tried to have
an reasonable explanation.

It turns out its a strange limitation.

^ permalink raw reply

* Re: IPv6 over Firewire
From: Stephan Gatzka @ 2012-12-21 18:39 UTC (permalink / raw)
  To: YOSHIFUJI Hideaki; +Cc: netdev, linux1394-devel
In-Reply-To: <50D4A219.7080807@linux-ipv6.org>

> If you are talking about how to build NS/NA/RS/Redirect messages, you
> can just use ndisc_opt_addr_space() and ndisc_fill_addr_option() here.

Thanks, these functions are certainly helpful. But 
ndisc_opt_addr_space() calculates the required space from dev->addr_len 
and ndisc_addr_option_pad(dev->type). The latter is 0 for IEEE1394 
(firewire). So the required option space just comes from dev->addr_len, 
which is 8 for firewire, resulting in an option address space of 16 (2 
octets).

But rfc3146 requires an option address space of 3 octets. So my main 
question is if in such a situation the best is to reserve additional skb 
tail room using needed_tailroom in struct netdevice. This directly 
affects the memory allocated in ndisc_build_skb().

Regards,

Stephan

^ permalink raw reply

* Re: [RFC] IP_MAX_MTU value
From: Rick Jones @ 2012-12-21 18:50 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1356114879.21834.7709.camel@edumazet-glaptop>

On 12/21/2012 10:34 AM, Eric Dumazet wrote:
> On Fri, 2012-12-21 at 10:19 -0800, Rick Jones wrote:
>
>> If you go beyond the protocol limit of an IPv4 datagram, won't it be
>> necessary to  start being a bit more conditional on IPv4 vs IPv6?
>>
>
> This IP_MAX_MTU is really an IPv4 thing (static to net/ipv4/route.c)

OK. Doesn't this:

         if (mtu > IP_MAX_MTU)
                 mtu = IP_MAX_MTU;

mean it should be OK to go to 0xFFFF but not 0x10000?  Since 65535 is 
the limit of an IPv4 datagram and so I would think would be the maximum 
MTU for an IPv4 interface.

rick

^ permalink raw reply

* Re: TCP sequence number inference attack on Linux
From: Eric Dumazet @ 2012-12-21 18:52 UTC (permalink / raw)
  To: Zhiyun Qian; +Cc: netdev
In-Reply-To: <1356114663.21834.7697.camel@edumazet-glaptop>

On Fri, 2012-12-21 at 10:31 -0800, Eric Dumazet wrote:

> I believe RFC 5961 was implemented in recent linux versions.
> 
> Is the described vulnerability still present ?
> 

By the way, I believe Chrome browser uses private network namespaces,
and statistics are per network namespace, so it should be safe.

^ permalink raw reply

* Re: [RFC] IP_MAX_MTU value
From: Eric Dumazet @ 2012-12-21 18:54 UTC (permalink / raw)
  To: Rick Jones; +Cc: David Miller, netdev
In-Reply-To: <50D4AF72.2020101@hp.com>

On Fri, 2012-12-21 at 10:50 -0800, Rick Jones wrote:
> On 12/21/2012 10:34 AM, Eric Dumazet wrote:
> > On Fri, 2012-12-21 at 10:19 -0800, Rick Jones wrote:
> >
> >> If you go beyond the protocol limit of an IPv4 datagram, won't it be
> >> necessary to  start being a bit more conditional on IPv4 vs IPv6?
> >>
> >
> > This IP_MAX_MTU is really an IPv4 thing (static to net/ipv4/route.c)
> 
> OK. Doesn't this:
> 
>          if (mtu > IP_MAX_MTU)
>                  mtu = IP_MAX_MTU;
> 
> mean it should be OK to go to 0xFFFF but not 0x10000?  Since 65535 is 
> the limit of an IPv4 datagram and so I would think would be the maximum 
> MTU for an IPv4 interface.

Sure, I meant 0xFFFF and it is the value I used in my tests.

65536 is the current MTU of loopback device, and this can be used by
other protocols.

^ permalink raw reply

* Re: TCP sequence number inference attack on Linux
From: Zhiyun Qian @ 2012-12-21 19:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1356114663.21834.7697.camel@edumazet-glaptop>

That's good to know. However, implementing RFC 5961 alone is not
sufficient. Like I said, since DelayedAckLost counter is incremented
purely upon looking at the sequence number, regardless of the ACK
number. An attacker thus can still infer the sequence number based on
DelayedAckLost counter without knowing the right ACK number.

The next question is how can the attacker eventually know the right
ACK number in order to inject real data. It turns out that this is not
hard either. First, based on the current Linux TCP stack, it accepts
incoming packets without ACK flag. Further, if ACK flag is not set,
ACK number will not be checked at all. See code in
net/ipv4/tcp_input.c, function tcp_rcv_established()

5547        if (th->ack && tcp_ack(sk, skb, FLAG_SLOWPATH) < 0)
5548                goto discard;

Second, even if ACK number is always checked before accepting the
payload, it is still possible to infer the ACK number much like how
sequence number can be inferred. The details is described in Section
3.4 of my paper, paragraph starting with "Client-side sequence number
inference".

I'm looking at the latest kernel v3.7.1 right now. I believe the
problem do still exist in today's Linux.

-Zhiyun

On Fri, Dec 21, 2012 at 1:31 PM, Eric Dumazet <erdnetdev@gmail.com> wrote:
> On Fri, 2012-12-21 at 12:58 -0500, Zhiyun Qian wrote:
>> Dear sir or madam,
>>
>> My name is Zhiyun Qian, a recent PhD graduate from University of
>> Michigan. As our recent research effort, along with my colleagues, we
>> identified a vulnerability related to Linux. Details described in our
>> paper published at this year's ACM Conference on Computer and
>> Communications Security (CCS): Collaborative TCP Sequence Number
>> Inference Attack available
>> http://web.eecs.umich.edu/~zhiyunq/pub/ccs12_TCP_sequence_number_inference.pdf
>>
>> Keywords: TCP, sequence number inference, DelayedAckLost counter,
>> privilege-escalation attack
>>
>> The vulnerability would allow an local malicious program to gain write
>> access to TCP connections of other applications. An example attack
>> scenario (on android) would be "an attacker uploads a seemingly benign
>> app to the google play, when run at the background, it can inject
>> malicious HTML payload into a webpage open by the browser".
>>
>> The problem is caused by the common TCP stats counters (the specific
>> counter I found is DelayedACKLost) maintained by the kernel (but
>> exposed to user space). By reading and reporting such counters to an
>> external attacker (colluded), the aforementioned attack can be
>> accomplished.
>>
>> It is essentially a side-channel attack (using TCP stats counters to
>> infer TCP sequence numbers), but it is so real and easy to carry out
>> that I believe it should be considered a vulnerability. From a
>> difference perspective, this attack can be considered as a privilege
>> escalation attack since it allows a local non-privileged program to
>> gain write access to TCP connections made by other processes.
>>
>> The lines of code in kernel impacted by the attack is:
>> net/ipv4/tcp_input.c at line 4166 (as in the latest kernel v3.7.1)
>>
>> 4160 static void tcp_send_dupack(struct sock *sk, const struct sk_buff *skb)
>> 4161 {
>> 4162        struct tcp_sock *tp = tcp_sk(sk);
>> 4163
>> 4164        if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
>> 4165            before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) {
>> 4166                NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_DELAYEDACKLOST);
>> 4167                tcp_enter_quickack_mode(sk);
>> 4168
>> 4169                if (tcp_is_sack(tp) && sysctl_tcp_dsack) {
>> 4170                        u32 end_seq = TCP_SKB_CB(skb)->end_seq;
>> 4171
>> 4172                        if (after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt))
>> 4173                                end_seq = tp->rcv_nxt;
>> 4174                        tcp_dsack_set(sk, TCP_SKB_CB(skb)->seq, end_seq);
>> 4175                }
>> 4176        }
>> 4177
>> 4178        tcp_send_ack(sk);
>> 4179 }
>>
>>
>> IMHO, an easy fix to the problem is to disallow unprivileged access to
>> counters such as DelayedAckLost. However, this solution may not be
>> ideal. A better way is to always enforce both acknowledgement number
>> and sequence number check on each incoming TCP packet instead of
>> checking one at a time. Currently, Linux TCP stack first only
>> validates the SEQ number and then subsequently the ACK number. If the
>> sequence number is invalid, the delayedAckLost counter will be
>> incremented (information about sequence number leaked already
>> regardless of the ACK number). To make attacker's job much harder (we
>> should require the attacker to guess both the valid sequence number
>> and ACK number at the same time). For instance, following RFC 5961:
>> (SND.UNA - MAX.SND.WND) <= SEG.ACK <= SND.NXT, this would require the
>> attacker to send many times (on the order of 10000) more packets to
>> conduct the same attack.
>>
>> Please feel free to ask me any questions regarding this vulnerability.
>
>
> I believe RFC 5961 was implemented in recent linux versions.
>
> Is the described vulnerability still present ?
>
>
>

^ permalink raw reply

* RE: [PATCH 1/3] net: stmmac: change GMAC control register for SGMII
From: Byungho An @ 2012-12-21 19:17 UTC (permalink / raw)
  To: 'Giuseppe CAVALLARO'
  Cc: davem, jeffrey.t.kirsher, netdev, kgene.kim, linux-kernel
In-Reply-To: <50BDEDE8.4030105@st.com>

On 12/04/2012 06:35 AM, Giuseppe CAVALLARO wrote:
>On 11/28/2012 11:57 AM, Byungho An wrote:
>> On 11/26/2012 07:31 PM, Giuseppe CABALLARO wrote:
>>> On 11/23/2012 10:04 AM, Byungho An wrote:
>>>>
>>>> This patch changes GMAC control register (TC(Transmit
>>>> Configuration) and PS(Port Selection) bit for SGMII.
>>>> In case of SGMII, TC bit is '1' and PS bit is 0.
>>>
>>> IMO this new support that should be released for net-next and further
>>> effort is actually needed.
>>>
>> OK, I see but if possible, I want to support the new features which is
>> included in this patch from v3.8
>
>ok I agree and I can support you.
>
>>
>>> The availability of the PCS registers is given by looking at the HW
>>> feature register. In fact, these are optional registers.
>>> I don't want to break the compatibility with old chips.
>>>
>> It means that old chip doesn't have this bit or this register? If that,
how
>> about using compatible in DT blob like snps,dwmac-3.70a and then in just
>> this case trying to read this bit and this register.
>
>The driver also works on mac 10/100 Databook 2.0 that has not these
>registers.
>
>>> I do not see why we have to use Kconfig macro to select ANE etc (as
>>> you do in your patches).
>> OK. I agree with you.
>
>we have to use the HW feature reg.
>
>>
>>> The driver could directly manage the phy device by itself if possible
>>> and the stmmac_init_phy should be reworked.
>>>
>> Could you explain more detail? As I understood, after set ANE bit in MAC
>> side then PHY auto-negotiation can be enabled. If I'm wrong let me know.
>> According to your mention, MAC and PHY auto-negotiation can be managed
>in
>> stmmac_init_phy?
>
>Currently the driver uses the Physical Abstraction Layer (PAL) to dialog
>with a PHY. On all the platforms supported (not only ST) we have always
>used it. Personally, I tested several phy devices with different MII
>interfaces (MII/RMII/GMII/RGMII ... ) but not TBI/RTBI/SGMII interfaces.
>
Currently I'm testing on SGMII interface and so far it's working fine.

>>> There are several things that need to be implemented. For example:
>>>
>>> The ISR (e.g. priv->hw->mac->host_irq_status) should be able to manage
>>> these new interrupts.
>> I think that there would be two additional interrupts."PCS
Auto-Negotiation
>> Complete" and "PCS Link Status Changed". These two interrupts are added
>to
>> "stmmac_interrupt". In my opinion, there are no specific processing for
>> these two irqs. What do you think about it?
>
>if the link changes this has to be logged in the driver.
>For example, depending on the link speed on some platforms we need to
>call dedicated call-back to set sysconfig registers or custom clocks.
>
>>> The code has to be able to maintain the user interface.
>>> For example if you want to enable ANE or manage Advertisement caps.
>>>
>> Does it mean that command line or other network command(e.g. ifconfig...)
>or
>> ioctol? Actually I don't understand exact user interface way. Could you
>> recommend the method for user interface?
>
>Using ethtool or mii-tool the user want to know the link status. So
>these kind of information have to be maintained.
>
As I know, user can set speed, duplex mode and auto-negotiation on/off using
ethtool.
If user wants to set auto-negotiation on, I think GMAC's ANE bit should be
ON as well.
For example, in stmmac_ethtool.c file, I try to add GMAC ANE bit enable
setting.
I think stmmac_set_pauseparam function is suitable for it.

	if (phy->autoneg) {
		if (netif_running(netdev)) {
 			ret = phy_start_aneg(phy);
+			value = readl(priv->ioaddr + GMAC_AN_CTRL);
+			value |= 0x1000;
+			writel(value, priv->ioaddr + GMAC_AN_CTRL);
+		}

What's your opinion?

>Take a look at the stmmac_adjust_link that is called by the PAL.
>
And for speed setting, if user changes speed to 1Gb using ethtool, link
status will be changed and then interrupt is occurred.
In the ISR, If interface is SGMII, I want to add SGMRAL setting and PS(port
selection)for guarantee 1Gb speed.
In the dwmac1000_core.c file.

+	if (unlikely(intr_status & pcs_link_irq)) {
+		readl(ioaddr + GMAC_AN_STATUS);
+		status |= core_irq_pcs_link;

+		if (priv->phydev->interface == PHY_INTERFACE_MODE_SGMII) {
+			value = readl(priv->ioaddr);
+			/* GMAC_CONTROL_PS : Port Selection for GMII */
+			value &= ~GMAC_CONTROL_PS;
+			writel(value, priv->ioaddr);
+			value = readl(priv->ioaddr + GMAC_AN_CTRL);
+			value = |= 0x40000;
+			writel(value, priv->ioaddr + GMAC_AN_CTRL);
+		}		
+	}

Then stmmac_interrupt in stmmac_main.c file call stmmac_adjust_link when
status is core_irq_pcs_link.
 
For auto-negotiation complete interrupt, I try to make patch like below

drivers/net/ethernet/stmicro/stmmac/common.h
@@ -184,6 +184,7 @@ enum core_specific_irq_mask {
 	core_irq_tx_path_exit_lpi_mode = 32,
 	core_irq_rx_path_in_lpi_mode = 64,
 	core_irq_rx_path_exit_lpi_mode = 128,
+	core_irq_rx_pcs_an = 256,
 };
 
 /* DMA HW capabilities */

drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c
@@ -231,6 +231,11 @@ static int dwmac1000_irq_status(void __iomem *ioaddr)
 		readl(ioaddr + GMAC_PMT);
 		status |= core_irq_receive_pmt_irq;
 	}
+	if (unlikely(intr_status & pcs_ane_irq)) {
+		CHIP_DBG(KERN_INFO "GMAC: PCS Auto-negotiation complete\n");
+		readl(ioaddr + GMAC_AN_STATUS);
+		status |= core_irq_pcs_an;
+	}
 	/* MAC trx/rx EEE LPI entry/exit interrupts */
 	if (intr_status & lpiis_irq) {
 		/* Clean LPI interrupt by reading the Reg 12 */

drivers/net/ethernet/stmicro/stmmac/stmmac.h
@@ -93,6 +93,7 @@ struct stmmac_priv {
 	int eee_enabled;
 	int eee_active;
 	int tx_lpi_timer;
+	bool core_pcs_an;
 };
 
 extern int phyaddr;

drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1658,6 +1658,8 @@ static irqreturn_t stmmac_interrupt(int irq, void
*dev_id)
 				priv->xstats.mmc_rx_csum_offload_irq_n++;
 			if (status & core_irq_receive_pmt_irq)
 				priv->xstats.irq_receive_pmt_irq_n++;
+			if (status & core_irq_pcs_an)
+				priv->core_pcs_an = true;
 
 			/* For LPI we need to save the tx status */
 			if (status & core_irq_tx_path_in_lpi_mode) {

>>>> Signed-off-by: Byungho An <bh74.an@samsung.com>
>>>> ---
>>>
>>> [snip]
>>>
>>>> +	if (priv->phydev->interface == PHY_INTERFACE_MODE_SGMII) {
>>>> +		value = readl(priv->ioaddr);
>>>> +		/* GMAC_CONTROL_TC : transmit config in RGMII/SGMII */
>>>> +		value |= 0x1000000;
>>>> +		/* GMAC_CONTROL_PS : Port Selection for GMII */
>>>> +		value &= ~(0x8000);
>>>> +		writel(value, priv->ioaddr);
>>>> +	}
>>>> +
>>>
>>>
>>> This parts of code have to be moved in
>>> drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c
>>>
>> OK.
>>
I moved this code todwmac1000_core.c using hw feature register like below

+#include "dwmac_dma.h"
 
 static void dwmac1000_core_init(void __iomem *ioaddr)
 {
 	u32 value = readl(ioaddr + GMAC_CONTROL);
+	u32 hw_feature = readl(ioaddr + DMA_HW_FEATURE);
 	value |= GMAC_CORE_INIT;
 	writel(value, ioaddr + GMAC_CONTROL);
+	hw_feature &= DMA_HW_FEAT_ACTPHYIF;
+
+	if ((hw_feature >> 28 < 2) && (hw_feature != 0) {
+		/* transmit config in RGMII/SGMII */
+		value |= GMAC_CONTROL_TC;
+	}
+	writel(value, ioaddr + GMAC_CONTROL);

>>> Pls, do not use value |= 0x1000000 but provide the appropriate defines.
>>>
>> OK.
>>
>>>>    	/* Request the IRQ lines */
>>>>    	ret = request_irq(dev->irq, stmmac_interrupt,
>>>>    			 IRQF_SHARED, dev->name, dev);
>>>>
>> Thank you.
>
>you are welcome
>Peppe
>
>> Byungho An.
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
Thank you.
Byungho An.

 --
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org More majordomo
info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: TCP sequence number inference attack on Linux
From: Zhiyun Qian @ 2012-12-21 19:13 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1356115931.21834.7745.camel@edumazet-glaptop>

That seems like a good idea. I am not sure how it is implemented
though. Is it a new feature of Linux? Would you mind sending some
pointers for this?

Thanks.
-Zhiyun

On Fri, Dec 21, 2012 at 1:52 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2012-12-21 at 10:31 -0800, Eric Dumazet wrote:
>
>> I believe RFC 5961 was implemented in recent linux versions.
>>
>> Is the described vulnerability still present ?
>>
>
> By the way, I believe Chrome browser uses private network namespaces,
> and statistics are per network namespace, so it should be safe.
>
>
>

^ permalink raw reply

* Re: TCP sequence number inference attack on Linux
From: Eric Dumazet @ 2012-12-21 19:27 UTC (permalink / raw)
  To: Zhiyun Qian; +Cc: netdev
In-Reply-To: <CALvgte9x4LWV8rsquKzNoVcXM14xZjvyuTgpHr-ynZXfGtzzpw@mail.gmail.com>

On Fri, 2012-12-21 at 14:10 -0500, Zhiyun Qian wrote:
> That's good to know. However, implementing RFC 5961 alone is not
> sufficient. Like I said, since DelayedAckLost counter is incremented
> purely upon looking at the sequence number, regardless of the ACK
> number. An attacker thus can still infer the sequence number based on
> DelayedAckLost counter without knowing the right ACK number.
> 



> The next question is how can the attacker eventually know the right
> ACK number in order to inject real data. It turns out that this is not
> hard either. First, based on the current Linux TCP stack, it accepts
> incoming packets without ACK flag. 

I dont really think so.

We must discard frame is th->ack is not set. (Step 5, line 6142)



> Further, if ACK flag is not set,
> ACK number will not be checked at all. See code in
> net/ipv4/tcp_input.c, function tcp_rcv_established()
> 
> 5547        if (th->ack && tcp_ack(sk, skb, FLAG_SLOWPATH) < 0)
> 5548                goto discard;
> 
> Second, even if ACK number is always checked before accepting the
> payload, it is still possible to infer the ACK number much like how
> sequence number can be inferred. The details is described in Section
> 3.4 of my paper, paragraph starting with "Client-side sequence number
> inference".
> 
> I'm looking at the latest kernel v3.7.1 right now. I believe the
> problem do still exist in today's Linux.
> 

It seems you know pretty well this code, I wonder why you dont send
patches to fix the bugs...

Its not like it has to be buggy forever.

^ permalink raw reply

* Re: TCP sequence number inference attack on Linux
From: Eric Dumazet @ 2012-12-21 19:30 UTC (permalink / raw)
  To: Zhiyun Qian; +Cc: netdev
In-Reply-To: <CALvgte-=v=bRnQ2HPDxyqxBOjD6DXp_BUySW6Hm7h+81M+RYMA@mail.gmail.com>

On Fri, 2012-12-21 at 14:13 -0500, Zhiyun Qian wrote:
> That seems like a good idea. I am not sure how it is implemented
> though. Is it a new feature of Linux? Would you mind sending some
> pointers for this?

I dont have details, I only remember having to fix a memory allocation
error that was hitting Chrome users.

http://comments.gmane.org/gmane.linux.network/249535

Julien said at that time :

<quote>
It happens when users start Chrome. Chrome will create one new network
NS (for the sandbox).

This has been used for a few years now, but we had our first report in
January of this year and we've been getting a few reports very
recently at a rate that is starting to worry me (crbug.com/110756).

Thanks a lot for helping with this!

Julien
</quote>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox