Netdev List
 help / color / mirror / Atom feed
* Re: [GIT PULL] Support for Cadence GEM in the MACB driver
From: Jean-Christophe PLAGNIOL-VILLARD @ 2011-11-22 10:00 UTC (permalink / raw)
  To: Jamie Iles; +Cc: netdev, Nicolas Ferre, linux-arm-kernel, Arnd Bergmann, davem
In-Reply-To: <20111121100020.GB7314@totoro>

On 10:00 Mon 21 Nov     , Jamie Iles wrote:
> Hi Arnd,
> 
> Please consider pulling the patches to add support for Cadence GEM to 
> the MACB driver.  These have been ready to go for a little while but got 
> held up with the rename of Ethernet drivers in the last merge window.  
> It would be great if we can get some exposure in -next for a little 
> while before the next merge window.
> 
> Thanks,
> 
> Jamie
> 
> The following changes since commit cfcfc9eca2bcbd26a8e206baeb005b055dbf8e37:
> 
>   Linux 3.2-rc2 (2011-11-15 15:02:59 -0200)
> 
> are available in the git repository at:
>   git://github.com/jamieiles/linux-2.6-ji.git macb-gem
Arnd please wait

please update his patch

"at91: provide macb clks with "pclk" and "hclk" name"

with the version I send to the ML

Best Regards,
J.
> 
> Jamie Iles (9):
>       at91: provide macb clks with "pclk" and "hclk" name
>       macb: remove conditional clk handling
>       macb: unify at91 and avr32 platform data
>       macb: convert printk to netdev_ and friends
>       macb: initial support for Cadence GEM
>       macb: support higher rate GEM MDIO clock divisors
>       macb: support statistics for GEM devices
>       macb: support DMA bus widths > 32 bits
>       macb: allow GEM to have configurable receive buffer size
> 
>  arch/arm/mach-at91/at91cap9.c               |    8 +-
>  arch/arm/mach-at91/at91cap9_devices.c       |    6 +-
>  arch/arm/mach-at91/at91rm9200_devices.c     |    6 +-
>  arch/arm/mach-at91/at91sam9260.c            |    8 +-
>  arch/arm/mach-at91/at91sam9260_devices.c    |    6 +-
>  arch/arm/mach-at91/at91sam9263.c            |    8 +-
>  arch/arm/mach-at91/at91sam9263_devices.c    |    6 +-
>  arch/arm/mach-at91/at91sam9g45.c            |    8 +-
>  arch/arm/mach-at91/at91sam9g45_devices.c    |    6 +-
>  arch/arm/mach-at91/board-1arm.c             |    2 +-
>  arch/arm/mach-at91/board-afeb-9260v1.c      |    2 +-
>  arch/arm/mach-at91/board-cam60.c            |    2 +-
>  arch/arm/mach-at91/board-cap9adk.c          |    2 +-
>  arch/arm/mach-at91/board-carmeva.c          |    2 +-
>  arch/arm/mach-at91/board-cpu9krea.c         |    2 +-
>  arch/arm/mach-at91/board-cpuat91.c          |    2 +-
>  arch/arm/mach-at91/board-csb337.c           |    2 +-
>  arch/arm/mach-at91/board-csb637.c           |    2 +-
>  arch/arm/mach-at91/board-eb9200.c           |    2 +-
>  arch/arm/mach-at91/board-ecbat91.c          |    2 +-
>  arch/arm/mach-at91/board-eco920.c           |    2 +-
>  arch/arm/mach-at91/board-foxg20.c           |    2 +-
>  arch/arm/mach-at91/board-gsia18s.c          |    2 +-
>  arch/arm/mach-at91/board-kafa.c             |    2 +-
>  arch/arm/mach-at91/board-kb9202.c           |    2 +-
>  arch/arm/mach-at91/board-neocore926.c       |    2 +-
>  arch/arm/mach-at91/board-pcontrol-g20.c     |    2 +-
>  arch/arm/mach-at91/board-picotux200.c       |    2 +-
>  arch/arm/mach-at91/board-qil-a9260.c        |    2 +-
>  arch/arm/mach-at91/board-rm9200dk.c         |    2 +-
>  arch/arm/mach-at91/board-rm9200ek.c         |    2 +-
>  arch/arm/mach-at91/board-rsi-ews.c          |    2 +-
>  arch/arm/mach-at91/board-sam9-l9260.c       |    2 +-
>  arch/arm/mach-at91/board-sam9260ek.c        |    2 +-
>  arch/arm/mach-at91/board-sam9263ek.c        |    2 +-
>  arch/arm/mach-at91/board-sam9g20ek.c        |    2 +-
>  arch/arm/mach-at91/board-sam9m10g45ek.c     |    2 +-
>  arch/arm/mach-at91/board-snapper9260.c      |    2 +-
>  arch/arm/mach-at91/board-stamp9g20.c        |    2 +-
>  arch/arm/mach-at91/board-usb-a926x.c        |    2 +-
>  arch/arm/mach-at91/board-yl-9200.c          |    2 +-
>  arch/arm/mach-at91/include/mach/board.h     |   14 +-
>  arch/avr32/boards/atngw100/setup.c          |    2 +-
>  arch/avr32/boards/atstk1000/atstk1002.c     |    2 +-
>  arch/avr32/boards/favr-32/setup.c           |    2 +-
>  arch/avr32/boards/hammerhead/setup.c        |    2 +-
>  arch/avr32/boards/merisc/setup.c            |    2 +-
>  arch/avr32/boards/mimc200/setup.c           |    2 +-
>  arch/avr32/mach-at32ap/at32ap700x.c         |    8 +-
>  arch/avr32/mach-at32ap/include/mach/board.h |    7 +-
>  drivers/net/ethernet/Makefile               |    2 +-
>  drivers/net/ethernet/cadence/Kconfig        |   16 +-
>  drivers/net/ethernet/cadence/at91_ether.c   |    3 +-
>  drivers/net/ethernet/cadence/at91_ether.h   |    4 +-
>  drivers/net/ethernet/cadence/macb.c         |  344 +++++++++++++++++----------
>  drivers/net/ethernet/cadence/macb.h         |  150 ++++++++++++-
>  include/linux/platform_data/macb.h          |   17 ++
>  57 files changed, 492 insertions(+), 211 deletions(-)
>  create mode 100644 include/linux/platform_data/macb.h

^ permalink raw reply

* Re: [GIT PULL] Support for Cadence GEM in the MACB driver
From: Jamie Iles @ 2011-11-22 10:20 UTC (permalink / raw)
  To: Jean-Christophe PLAGNIOL-VILLARD
  Cc: Jamie Iles, Arnd Bergmann, Nicolas Ferre, netdev,
	linux-arm-kernel, davem
In-Reply-To: <20111122100032.GL21480@game.jcrosoft.org>

Hi Jean-Christophe,

On Tue, Nov 22, 2011 at 11:00:32AM +0100, Jean-Christophe PLAGNIOL-VILLARD wrote:
> On 10:00 Mon 21 Nov     , Jamie Iles wrote:
> > Hi Arnd,
> > 
> > Please consider pulling the patches to add support for Cadence GEM to 
> > the MACB driver.  These have been ready to go for a little while but got 
> > held up with the rename of Ethernet drivers in the last merge window.  
> > It would be great if we can get some exposure in -next for a little 
> > while before the next merge window.
> > 
> > Thanks,
> > 
> > Jamie
> > 
> > The following changes since commit cfcfc9eca2bcbd26a8e206baeb005b055dbf8e37:
> > 
> >   Linux 3.2-rc2 (2011-11-15 15:02:59 -0200)
> > 
> > are available in the git repository at:
> >   git://github.com/jamieiles/linux-2.6-ji.git macb-gem
> Arnd please wait
> 
> please update his patch
> 
> "at91: provide macb clks with "pclk" and "hclk" name"
> 
> with the version I send to the ML

Is that a patch that you've already posted or will be posting?  Is there 
a board that I've missed?  I thought I had them all.

Jamie

^ permalink raw reply

* [PATCH] r8169: Restore MAC address after resume on buggy BIOS
From: Lee, Chun-Yi @ 2011-11-22 10:35 UTC (permalink / raw)
  To: nic_swsd; +Cc: netdev, Lee, Chun-Yi, Hayes Wang, Francois Romieu

When we test r8169 driver with RTL8111E-VL, found have buggy BIOS
write garbage MAC address (e.g. FF:FF:FF:FF:FF:FF) to extended GigaMAC
registers when S3 resume. And, we also found Realtek Windows driver
can cover this buggy BIOS's problem.

So, this patch add a MAC address check in resume callback function and
restore MAC address if buggy BIOS changed it.

Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
Cc: Hayes Wang <hayeswang@realtek.com>
Cc: Realtek linux nic maintainers <nic_swsd@realtek.com>
Cc: Francois Romieu <romieu@fr.zoreil.com>
---
 drivers/net/ethernet/realtek/r8169.c |   33 +++++++++++++++++++++++++++++++++
 1 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 6f06aa1..91d4a09 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -3335,6 +3335,35 @@ static void rtl_rar_set(struct rtl8169_private *tp, u8 *addr)
 	spin_unlock_irq(&tp->lock);
 }
 
+static bool rtl_rar_check(struct rtl8169_private *tp, u8 *addr)
+{
+	void __iomem *ioaddr = tp->mmio_addr;
+	u32 high;
+	u32 low;
+
+	low  = addr[0] | (addr[1] << 8) | (addr[2] << 16) | (addr[3] << 24);
+	high = addr[4] | (addr[5] << 8);
+
+	if (RTL_R32(MAC4) != high)
+		return false;
+	if (RTL_R32(MAC0) != low)
+		return false;
+
+	if (tp->mac_version == RTL_GIGA_MAC_VER_34) {
+		if (rtl_eri_read(ioaddr, 0xe0, ERIAR_EXGMAC) != low)
+			return false;
+		if (rtl_eri_read(ioaddr, 0xe4, ERIAR_EXGMAC) != high)
+			return false;
+		if (rtl_eri_read(ioaddr, 0xf0, ERIAR_EXGMAC) != (low << 16))
+			return false;
+		if (rtl_eri_read(ioaddr, 0xf4, ERIAR_EXGMAC) != (high << 16 |
+								 low >> 16))
+			return false;
+	}
+
+	return true;
+}
+
 static int rtl_set_mac_address(struct net_device *dev, void *p)
 {
 	struct rtl8169_private *tp = netdev_priv(dev);
@@ -6098,6 +6127,10 @@ static int rtl8169_resume(struct device *device)
 
 	rtl8169_init_phy(dev, tp);
 
+	/* Restore MAC address if buggy BIOS changed it */
+	if (!rtl_rar_check(tp, dev->dev_addr))
+		rtl_rar_set(tp, dev->dev_addr);
+
 	if (netif_running(dev))
 		__rtl8169_resume(dev);
 
-- 
1.6.0.2

^ permalink raw reply related

* Re: WARNING: at mm/slub.c:3357, kernel BUG at mm/slub.c:3413
From: Markus Trippelsdorf @ 2011-11-22 11:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Lameter, Christian Kujau, Benjamin Herrenschmidt,
	Alex,Shi, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Pekka Enberg, Matt Mackall, netdev@vger.kernel.org, Tejun Heo
In-Reply-To: <1321954729.2474.4.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

On 2011.11.22 at 10:38 +0100, Eric Dumazet wrote:
> Le mardi 22 novembre 2011 à 09:45 +0100, Markus Trippelsdorf a écrit :
> 
> > I sometimes see the following pattern. Is this a false positive?
> > 
> > 
> > =============================================================================
> > BUG anon_vma: Redzone overwritten
> > -----------------------------------------------------------------------------
> > 
> > INFO: 0xffff88020f347c80-0xffff88020f347c87. First byte 0xbb instead of 0xcc
> > INFO: Allocated in anon_vma_fork+0x51/0x140 age=1 cpu=2 pid=1826
> > 	__slab_alloc.constprop.70+0x1ac/0x1e8
> > 	kmem_cache_alloc+0x12e/0x160
> > 	anon_vma_fork+0x51/0x140
> > 	dup_mm+0x1f2/0x4a0
> > 	copy_process+0xd10/0xf70
> > 	do_fork+0x100/0x2b0
> > 	sys_clone+0x23/0x30
> > 	stub_clone+0x13/0x20
> > INFO: Freed in __put_anon_vma+0x54/0xa0 age=0 cpu=1 pid=1827
> > 	__slab_free+0x33/0x2d0
> > 	kmem_cache_free+0x10e/0x120
> > 	__put_anon_vma+0x54/0xa0
> > 	unlink_anon_vmas+0x12f/0x1c0
> > 	free_pgtables+0x83/0xe0
> > 	exit_mmap+0xee/0x140
> > 	mmput+0x43/0xf0
> > 	flush_old_exec+0x33f/0x630
> > 	load_elf_binary+0x340/0x1960
> > 	search_binary_handler+0x8f/0x180
> > 	do_execve+0x2d3/0x370
> > 	sys_execve+0x42/0x70
> > 	stub_execve+0x6c/0xc0
> > INFO: Slab 0xffffea00083cd1c0 objects=10 used=9 fp=0xffff88020f347ab8 flags=0x4000000000000081
> > INFO: Object 0xffff88020f347c40 @offset=3136 fp=0xffff88020f347ab8
> > 
> > Bytes b4 ffff88020f347c30: 39 b6 fb ff 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a  9.......ZZZZZZZZ
> > Object ffff88020f347c40: 30 c9 9b 0d 02 88 ff ff 01 00 00 00 00 00 5a 5a  0.............ZZ
> > Object ffff88020f347c50: 50 7c 34 0f 02 88 ff ff 50 7c 34 0f 02 88 ff ff  P|4.....P|4.....
> > Object ffff88020f347c60: 00 00 00 00 00 00 00 00 00 00 00 00 5a 5a 5a 5a  ............ZZZZ
> > Object ffff88020f347c70: 70 7c 34 0f 02 88 ff ff 70 7c 34 0f 02 88 ff ff  p|4.....p|4.....
> > Redzone ffff88020f347c80: bb bb bb bb bb bb bb bb                          ........
> > Padding ffff88020f347dc0: 5a 5a 5a 5a 5a 5a 5a 5a                          ZZZZZZZZ
> > Pid: 1820, comm: slabinfo Not tainted 3.2.0-rc2-00369-gbbbc479-dirty #83
> > Call Trace:
> >  [<ffffffff81105df8>] ? print_section+0x38/0x40
> >  [<ffffffff811062f3>] print_trailer+0xe3/0x150
> >  [<ffffffff811064f0>] check_bytes_and_report+0xe0/0x100
> >  [<ffffffff81107313>] check_object+0x183/0x240
> >  [<ffffffff81107eb0>] validate_slab_slab+0x1c0/0x230
> >  [<ffffffff8110a4a6>] validate_store+0xa6/0x190
> >  [<ffffffff8110573c>] slab_attr_store+0x1c/0x30
> >  [<ffffffff81168838>] sysfs_write_file+0xc8/0x140
> >  [<ffffffff811124a3>] vfs_write+0xa3/0x160
> >  [<ffffffff81112635>] sys_write+0x45/0x90
> >  [<ffffffff814d3ffb>] system_call_fastpath+0x16/0x1b
> > FIX anon_vma: Restoring 0xffff88020f347c80-0xffff88020f347c87=0xcc
> 
> 
> Wait a minute
> 
> You trigger this using slabinfo looping or something ?
> 
> Bug is in slabinfo then, dont use it, and see if bug triggers.
> 
> Given slub is now lockless, validate_slab_slab() is probably very wrong
> these days.

OK "slabinfo -v" is useless then. But that doesn't invalidate the BUGs
that I saw during boot. They happend before I could even run slabinfo
for the first time.

-- 
Markus

^ permalink raw reply

* Re: NFS TCP race condition with SOCK_ASYNC_NOSPACE
From: Trond Myklebust @ 2011-11-22 11:38 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <4ECA94F9.4090503-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>

On Mon, 2011-11-21 at 18:14 +0000, Andrew Cooper wrote: 
> Following some debugging, I believe that the attached patch fixes the
> problem.
> 
> Simply returning EAGAIN is not sufficient, as the task does not get
> requeued, and times out 13 seconds later (as per our mount options). 
> Setting the SOCK_ASYNC_NOSPACE bit causes the requeue to happen.
> 
> I realize that this is a gross hack and I should probably not be using
> SOCK_ASYNC_NOSPACE in that way.  Is there a better way to achieve the
> same solution?
> 

What you are doing will cause the request to be put to sleep with no
guarantee that it will ever be woken up. Why would we want to do that if
there is no report of a tcp window/buffer space congestion?

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org
www.netapp.com

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 5/5] tcp: skip cwnd moderation in TCP_CA_Open in tcp_try_to_open
From: Ilpo Järvinen @ 2011-11-22 11:47 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Neal Cardwell, David Miller, Netdev, Nandita Dukkipati,
	Tom Herbert
In-Reply-To: <CAK6E8=cNBCJbQs9dKcOwU1yZMC6czgMEuZt4Ubt=t-kBzHfFmw@mail.gmail.com>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2970 bytes --]

On Mon, 21 Nov 2011, Yuchung Cheng wrote:

> On Sun, Nov 20, 2011 at 12:38 AM, Ilpo Järvinen
> <ilpo.jarvinen@helsinki.fi> wrote:
> > On Sat, 19 Nov 2011, Neal Cardwell wrote:
> >> If your concern is receivers lying, then there are other easy ways
> >> that a misbehaving receiver can get a sender to send too fast
> >> (e.g. cumulatively ACKing the highest sequence number seen,
> >> irrespective of holes). IMHO it would be a shame to penalize the vast
> >> majority of well-behaved users to plug one potential attack vector
> >> when there are other easy holes that an attacker would use.
> >
> > I disagree... there is huge difference as the receiver then has to gamble
> > with the reliability of the flow. DSACK attack is nasty in that that it
> > allows maintaining reliability while cheating bw.
> 
> I feel the original change Neal is proposing is getting lost. This patch 
> proposes when
> 
> a. ack is dubious
> b. but it's not time to recover yet
> c. and other checks in tcp_try_keep_open affirms the network is in
> good condition, i.e, stat == Open

Presence of DSACK certainly does disagree with your condition c (which 
was the purpose of this change? Though it could be present in a later 
segment). Network was in such a bad condition that DSACKs were needed in 
the first place.

> 3. do not moderate cwnd (not 3, not 10 or any magic number).
> 
> Your concern is valid, but if that's true for majority then I argue
> all undo logic on dsack or TS are causing major threats since they all
> require some trusts of the remote end. Why even bother processing
> DSACK if we can't trust them?

Right, it seems that DSACK RFCs (RFC 3708 mainly) don't care too much 
about the cheating possibility, just states that it is possible and 
that a non-available nonce solution would help. So yes, I myself find 
DSACK quite useless, albeit fancy, mechanism (there are some other uses 
not related to undo mentioned though in the RFC but I've not spent too 
much thinking the details).

As regards TS, we don't trust them fully either (see e.g., some CUBIC
related change from Stephen Hemminger couple of years ago).

> AFAIK cwnd moderation is to lower bursty drops not to discourage
> (dsack) cheating. I believe the reason is the same in
> tcp_try_to_open(). We are in common cases, e.g., loss-recovery, this
> logic hurts performance.

Quote from the patch description: "Senders were overriding cwnd values 
picked during an undo by calling tcp_moderate_cwnd()" ...so after all it 
has to do with undo being limited. IMHO only up to orig_cwnd/2+IW is safe 
due to cheating opportunities. Also FRTO uses orig_cwnd/2 due to same 
reason (it could do the +IW too but IIRC it is only /2 currently). What 
would be the safeguard there after this one is removed? I kind of see your 
point about this particular line being related to burst mitigation but on 
the same time the end result of removal is that undo becomes potentially 
much more aggressive.


-- 
 i.

^ permalink raw reply

* Re: NFS TCP race condition with SOCK_ASYNC_NOSPACE
From: Andrew Cooper @ 2011-11-22 12:02 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <1321961913.3323.67.camel@lade.trondhjem.org>

On 22/11/11 11:38, Trond Myklebust wrote:
> On Mon, 2011-11-21 at 18:14 +0000, Andrew Cooper wrote: 
>> Following some debugging, I believe that the attached patch fixes the
>> problem.
>>
>> Simply returning EAGAIN is not sufficient, as the task does not get
>> requeued, and times out 13 seconds later (as per our mount options). 
>> Setting the SOCK_ASYNC_NOSPACE bit causes the requeue to happen.
>>
>> I realize that this is a gross hack and I should probably not be using
>> SOCK_ASYNC_NOSPACE in that way.  Is there a better way to achieve the
>> same solution?
>>
> What you are doing will cause the request to be put to sleep with no
> guarantee that it will ever be woken up. Why would we want to do that if
> there is no report of a tcp window/buffer space congestion?

But the reason we get to this code is because there was a report of
space collision.  What would you suggest instead?  Changing
xs_{tcp,udp}_send_request() to retry in this case would defeat the point
of having xs_nospace().

What should happen is the request getting re-queued to run at the next
available opportunity, rather than perhaps sleeping for a certain length
of time.  At the moment, leaving SOCK_ASYNC_NOSPACE unset causes the
request to never be woken, whereas setting that bit seems to always be
re-queued at some near point in the future.

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

^ permalink raw reply

* Re: NFS TCP race condition with SOCK_ASYNC_NOSPACE
From: Trond Myklebust @ 2011-11-22 12:10 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: linux-nfs@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <4ECB8F47.105@citrix.com>

On Tue, 2011-11-22 at 12:02 +0000, Andrew Cooper wrote: 
> On 22/11/11 11:38, Trond Myklebust wrote:
> > On Mon, 2011-11-21 at 18:14 +0000, Andrew Cooper wrote: 
> >> Following some debugging, I believe that the attached patch fixes the
> >> problem.
> >>
> >> Simply returning EAGAIN is not sufficient, as the task does not get
> >> requeued, and times out 13 seconds later (as per our mount options). 
> >> Setting the SOCK_ASYNC_NOSPACE bit causes the requeue to happen.
> >>
> >> I realize that this is a gross hack and I should probably not be using
> >> SOCK_ASYNC_NOSPACE in that way.  Is there a better way to achieve the
> >> same solution?
> >>
> > What you are doing will cause the request to be put to sleep with no
> > guarantee that it will ever be woken up. Why would we want to do that if
> > there is no report of a tcp window/buffer space congestion?
> 
> But the reason we get to this code is because there was a report of
> space collision.  What would you suggest instead?  Changing
> xs_{tcp,udp}_send_request() to retry in this case would defeat the point
> of having xs_nospace().

I suggest doing absolutely nothing: do what you originally proposed,
which is to report the EAGAIN so that the client state machine retries
the socket write.

My point is that this is a context which is _not_ atomic with the
original report of tcp window/buffer space congestion. There are no
locks or anything else that will guarantee that the congestion still
exists, and the fact that the SOCK_ASYNC_NOSPACE flag is now clear
indicates that this is the case.
The whole purpose of xs_nospace() is to wait until a congestion
condition clears. If the congestion clears before we get here, then we
have no reason to do anything special other than retry.

Trond

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

^ permalink raw reply

* Re: NFS TCP race condition with SOCK_ASYNC_NOSPACE
From: Andrew Cooper @ 2011-11-22 12:16 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <1321963839.7645.5.camel@lade.trondhjem.org>

On 22/11/11 12:10, Trond Myklebust wrote:
> On Tue, 2011-11-22 at 12:02 +0000, Andrew Cooper wrote: 
>> On 22/11/11 11:38, Trond Myklebust wrote:
>>> On Mon, 2011-11-21 at 18:14 +0000, Andrew Cooper wrote: 
>>>> Following some debugging, I believe that the attached patch fixes the
>>>> problem.
>>>>
>>>> Simply returning EAGAIN is not sufficient, as the task does not get
>>>> requeued, and times out 13 seconds later (as per our mount options). 
>>>> Setting the SOCK_ASYNC_NOSPACE bit causes the requeue to happen.
>>>>
>>>> I realize that this is a gross hack and I should probably not be using
>>>> SOCK_ASYNC_NOSPACE in that way.  Is there a better way to achieve the
>>>> same solution?
>>>>
>>> What you are doing will cause the request to be put to sleep with no
>>> guarantee that it will ever be woken up. Why would we want to do that if
>>> there is no report of a tcp window/buffer space congestion?
>> But the reason we get to this code is because there was a report of
>> space collision.  What would you suggest instead?  Changing
>> xs_{tcp,udp}_send_request() to retry in this case would defeat the point
>> of having xs_nospace().
> I suggest doing absolutely nothing: do what you originally proposed,
> which is to report the EAGAIN so that the client state machine retries
> the socket write.
>
> My point is that this is a context which is _not_ atomic with the
> original report of tcp window/buffer space congestion. There are no
> locks or anything else that will guarantee that the congestion still
> exists, and the fact that the SOCK_ASYNC_NOSPACE flag is now clear
> indicates that this is the case.
> The whole purpose of xs_nospace() is to wait until a congestion
> condition clears. If the congestion clears before we get here, then we
> have no reason to do anything special other than retry.
>
> Trond

I am slightly confused as to what you mean now.

When you take out the if(test_bit test and always set ret to EAGAIN and
requeue the request, the next time it wakes up is when it is killed due
to timeout.  This results in substantially worse effects for the
userspace, as the NFS session is killed.

Did you mean something else when you said "always report EAGAIN"?

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

^ permalink raw reply

* Re: 3.2-rc2+: Reported regressions from 3.0 and 3.1
From: Andrea Arcangeli @ 2011-11-22 12:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Srivatsa S. Bhat, Rafael J. Wysocki, Linux Kernel Mailing List,
	Maciej Rutecki, Florian Mickler, Linus Torvalds,
	Kernel Testers List, Network Development, Linux ACPI,
	Linux PM List, Linux SCSI List, Linux Wireless List, DRI
In-Reply-To: <20111121215918.febd9fd4.akpm@linux-foundation.org>

On Mon, Nov 21, 2011 at 09:59:18PM -0800, Andrew Morton wrote:
> grr, nothing in that patch's changelog indicates that it fixed a
> regression nor that it fixed an infinite blockage of suspend.

Well it's not a recent thing so I didn't flag it as a regression. It
doesn't infinite block it, suspend works fine almost all the time (or
it would have been noticed before), and if you've bad luck and it
doesn't suspend the first time around like someone experienced, if you
try again a bit later it'll work.

> I moved it to my 3.2 queue, thanks.

Thanks!

^ permalink raw reply

* Re: NFS TCP race condition with SOCK_ASYNC_NOSPACE
From: Trond Myklebust @ 2011-11-22 12:22 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: linux-nfs@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <4ECB9294.20002@citrix.com>

On Tue, 2011-11-22 at 12:16 +0000, Andrew Cooper wrote: 
> On 22/11/11 12:10, Trond Myklebust wrote:
> > On Tue, 2011-11-22 at 12:02 +0000, Andrew Cooper wrote: 
> >> On 22/11/11 11:38, Trond Myklebust wrote:
> >>> On Mon, 2011-11-21 at 18:14 +0000, Andrew Cooper wrote: 
> >>>> Following some debugging, I believe that the attached patch fixes the
> >>>> problem.
> >>>>
> >>>> Simply returning EAGAIN is not sufficient, as the task does not get
> >>>> requeued, and times out 13 seconds later (as per our mount options). 
> >>>> Setting the SOCK_ASYNC_NOSPACE bit causes the requeue to happen.
> >>>>
> >>>> I realize that this is a gross hack and I should probably not be using
> >>>> SOCK_ASYNC_NOSPACE in that way.  Is there a better way to achieve the
> >>>> same solution?
> >>>>
> >>> What you are doing will cause the request to be put to sleep with no
> >>> guarantee that it will ever be woken up. Why would we want to do that if
> >>> there is no report of a tcp window/buffer space congestion?
> >> But the reason we get to this code is because there was a report of
> >> space collision.  What would you suggest instead?  Changing
> >> xs_{tcp,udp}_send_request() to retry in this case would defeat the point
> >> of having xs_nospace().
> > I suggest doing absolutely nothing: do what you originally proposed,
> > which is to report the EAGAIN so that the client state machine retries
> > the socket write.
> >
> > My point is that this is a context which is _not_ atomic with the
> > original report of tcp window/buffer space congestion. There are no
> > locks or anything else that will guarantee that the congestion still
> > exists, and the fact that the SOCK_ASYNC_NOSPACE flag is now clear
> > indicates that this is the case.
> > The whole purpose of xs_nospace() is to wait until a congestion
> > condition clears. If the congestion clears before we get here, then we
> > have no reason to do anything special other than retry.
> >
> > Trond
> 
> I am slightly confused as to what you mean now.
> 
> When you take out the if(test_bit test and always set ret to EAGAIN and
> requeue the request, the next time it wakes up is when it is killed due
> to timeout.  This results in substantially worse effects for the
> userspace, as the NFS session is killed.

What is putting the request to sleep? It should be awake when it enters
xs_nospace(), and nothing in or after that function should be putting it
to sleep until we've retried with call_transmit().

> Did you mean something else when you said "always report EAGAIN"?

Nope.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

^ permalink raw reply

* Re: NFS TCP race condition with SOCK_ASYNC_NOSPACE
From: Andrew Cooper @ 2011-11-22 12:34 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <1321964578.7645.9.camel-SyLVLa/KEI9HwK5hSS5vWB2eb7JE58TQ@public.gmane.org>



On 22/11/11 12:22, Trond Myklebust wrote:
> On Tue, 2011-11-22 at 12:16 +0000, Andrew Cooper wrote: 
>> On 22/11/11 12:10, Trond Myklebust wrote:
>>> On Tue, 2011-11-22 at 12:02 +0000, Andrew Cooper wrote: 
>>>> On 22/11/11 11:38, Trond Myklebust wrote:
>>>>> On Mon, 2011-11-21 at 18:14 +0000, Andrew Cooper wrote: 
>>>>>> Following some debugging, I believe that the attached patch fixes the
>>>>>> problem.
>>>>>>
>>>>>> Simply returning EAGAIN is not sufficient, as the task does not get
>>>>>> requeued, and times out 13 seconds later (as per our mount options). 
>>>>>> Setting the SOCK_ASYNC_NOSPACE bit causes the requeue to happen.
>>>>>>
>>>>>> I realize that this is a gross hack and I should probably not be using
>>>>>> SOCK_ASYNC_NOSPACE in that way.  Is there a better way to achieve the
>>>>>> same solution?
>>>>>>
>>>>> What you are doing will cause the request to be put to sleep with no
>>>>> guarantee that it will ever be woken up. Why would we want to do that if
>>>>> there is no report of a tcp window/buffer space congestion?
>>>> But the reason we get to this code is because there was a report of
>>>> space collision.  What would you suggest instead?  Changing
>>>> xs_{tcp,udp}_send_request() to retry in this case would defeat the point
>>>> of having xs_nospace().
>>> I suggest doing absolutely nothing: do what you originally proposed,
>>> which is to report the EAGAIN so that the client state machine retries
>>> the socket write.
>>>
>>> My point is that this is a context which is _not_ atomic with the
>>> original report of tcp window/buffer space congestion. There are no
>>> locks or anything else that will guarantee that the congestion still
>>> exists, and the fact that the SOCK_ASYNC_NOSPACE flag is now clear
>>> indicates that this is the case.
>>> The whole purpose of xs_nospace() is to wait until a congestion
>>> condition clears. If the congestion clears before we get here, then we
>>> have no reason to do anything special other than retry.
>>>
>>> Trond
>> I am slightly confused as to what you mean now.
>>
>> When you take out the if(test_bit test and always set ret to EAGAIN and
>> requeue the request, the next time it wakes up is when it is killed due
>> to timeout.  This results in substantially worse effects for the
>> userspace, as the NFS session is killed.
> What is putting the request to sleep? It should be awake when it enters
> xs_nospace(), and nothing in or after that function should be putting it
> to sleep until we've retried with call_transmit().
>

I presume it is the call to xprt_wait_for_buffer_space() which calls
rpc_sleep_on().  There is xs_tcp_write_space which appears to wake it up
based on sk->sk_write_space which is triggered on the sock gaining more
space, which has already happened in this specific case.

Sorry if I am being a bit slow here - I am still learning my way round
an unfamiliar codebase.

>> Did you mean something else when you said "always report EAGAIN"?
> Nope.
>

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: MPLS for Linux kernel
From: Jorge Boncompte [DTI2] @ 2011-11-22 12:30 UTC (permalink / raw)
  To: igorm; +Cc: netdev, davem
In-Reply-To: <CAFdo_mXzqC_v5OkC6P012gFT_OmRO_yDTA0w96VhV-Lp3Hp7Fw@mail.gmail.com>

El 22/11/2011 9:52, Igor Maravić escribió:
> What are the main reasons for You to think that mpls-linux is not
> ready to go upstream?
> As I said, I fixed a lot of bugs so code can be run with all the
> kernel hacking options enabled, without causing panics and oopses.
> Also fixed a lot of other things and added preprocessor if statements
> where code is intertwined with normal kernel code, so I'm sure that it
> want break existing kernel compilation if the kernel is compiled
> without MPLS.
> Why would it be so hard to implemented it in terms of packet scheduler?

	First, please, don't top post.

	You keep insisting in that you fixed a lot of things, but you have provided a
git tree with just one big commit and say that have taken some of my patches on
it, could you please provide patches on TOP of the sourceforge code for the
things that are not fixed there? And for the new features like the MIB stats
work that you have done? It seems to me that you have not noticed that while
fixing bugs I have reworked a lot of code to make it cleaner or simpler, simply
deleted it and fixed the style.

	What's needs to be done, and it's on my TODO list...

	The kernel code that is not commented out on the mpls-linux code when you build
the kernel it the shim layer and it's not done on purpose. This code was written
by James to be a generic feature of the networking layer. Now I am not sure that
it has any value keeping it and am for removing it.
	The other thing that probably I am going to remove is the labelspace support. I
don't see a use for it, and even Cisco doesn't implement it either that I know.
	Then we must rework the netlink interface to make it cleaner and extensible.
	Move the tunnel code to use the netlink interface and generic tunnel code.
	Check the dst's usages, there has been a lot of changes in the core kernel here
lately and I am not sure if we are using it correctly.
	Check the locking and RCU usage.
	Look for a way to remove the mpls_ptr from the net_device structure. There's
code on another branch that does a radix-tree lookup for every packet but I
would like something simpler/lighter.

	As I said I am not for implementing MPLS support on top of the openvswitch
stuff, that I don't know, like I don't think that we are going to port the
bridge, vlan, ip_gre, or even l2tp support over it, aren't we? :)

	I am committed to support this code as best as I can and try to get it merged
but I would be nice if at least I can receive I confirmation of where we are
heading on upstream.

	Regards,
		Jorge

^ permalink raw reply

* Re: [PATCH 4/5] tcp: allow undo from reordered DSACKs
From: Ilpo Järvinen @ 2011-11-22 12:44 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: David Miller, Netdev, Nandita Dukkipati, Yuchung Cheng,
	Tom Herbert
In-Reply-To: <CADVnQykNffEvpr5bnLiLyNGZiuhxx=5wxqFaZjC+Dk4mxe1Okw@mail.gmail.com>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4409 bytes --]

On Thu, 17 Nov 2011, Neal Cardwell wrote:

> On Thu, Nov 17, 2011 at 12:18 AM, Ilpo Järvinen
> <ilpo.jarvinen@helsinki.fi> wrote:
> > On Wed, 16 Nov 2011, Neal Cardwell wrote:
> >
> >> Previously, SACK-enabled connections hung around in TCP_CA_Disorder
> >> state while snd_una==high_seq, just waiting to accumulate DSACKs and
> >> hopefully undo a cwnd reduction. This could and did lead to the
> >> following unfortunate scenario: if some incoming ACKs advance snd_una
> >> beyond high_seq then we were setting undo_marker to 0 and moving to
> >> TCP_CA_Open, so if (due to reordering in the ACK return path) we
> >> shortly thereafter received a DSACK then we were no longer able to
> >> undo the cwnd reduction.
> >>
> >> The change: Simplify the congestion avoidance state machine by
> >> removing the behavior where SACK-enabled connections hung around in
> >> the TCP_CA_Disorder state just waiting for DSACKs. Instead, when
> >> snd_una advances to high_seq or beyond we typically move to
> >> TCP_CA_Open immediately and allow an undo in either TCP_CA_Open or
> >> TCP_CA_Disorder if we later receive enough DSACKs.
> >>
> >> Other patches in this series will provide other changes that are
> >> necessary to fully fix this problem.
> >>
> >> Signed-off-by: Neal Cardwell <ncardwell@google.com>
> >> ---
> >>  net/ipv4/tcp_input.c |   15 ++-------------
> >>  1 files changed, 2 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> >> index 751d390..a4efdd7 100644
> >> --- a/net/ipv4/tcp_input.c
> >> +++ b/net/ipv4/tcp_input.c
> >> @@ -2858,7 +2858,7 @@ static void tcp_try_keep_open(struct sock *sk)
> >>       struct tcp_sock *tp = tcp_sk(sk);
> >>       int state = TCP_CA_Open;
> >>
> >> -     if (tcp_left_out(tp) || tcp_any_retrans_done(sk) || tp->undo_marker)
> >> +     if (tcp_left_out(tp) || tcp_any_retrans_done(sk))
> >>               state = TCP_CA_Disorder;
> >>
> >>       if (inet_csk(sk)->icsk_ca_state != state) {
> >> @@ -3066,17 +3066,6 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
> >>                       }
> >>                       break;
> >>
> >> -             case TCP_CA_Disorder:
> >> -                     tcp_try_undo_dsack(sk);
> >> -                     if (!tp->undo_marker ||
> >> -                         /* For SACK case do not Open to allow to undo
> >> -                          * catching for all duplicate ACKs. */
> >> -                         tcp_is_reno(tp) || tp->snd_una != tp->high_seq) {
> >> -                             tp->undo_marker = 0;
> >> -                             tcp_set_ca_state(sk, TCP_CA_Open);
> >> -                     }
> >> -                     break;
> >> -
> >>               case TCP_CA_Recovery:
> >>                       if (tcp_is_reno(tp))
> >>                               tcp_reset_reno_sack(tp);
> >> @@ -3117,7 +3106,7 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
> >>                               tcp_add_reno_sack(sk);
> >>               }
> >>
> >> -             if (icsk->icsk_ca_state == TCP_CA_Disorder)
> >> +             if (icsk->icsk_ca_state <= TCP_CA_Disorder)
> >>                       tcp_try_undo_dsack(sk);
> >>
> >>               if (!tcp_time_to_recover(sk)) {
> >
> > How about extending Disorder state until second cumulative ACK that is
> > acking >= high_seq?
> 
> That would seem to add complexity but only provide a partial solution.

Right, I forgot the reordering.

> This proposed patch has the virtue of providing a general solution
> while simplifying the code a little.
>
> What are your concerns with this patch?

...I was thinking that if we go to the direction of removal of more and 
more CA_Disorder stuff we could just as well remove the whole state. But 
I've since that time convinced myself that the state itself is still 
necessary even if less action takes place in it after this patch.

Also, one (serious) word of caution! This change, by its extending of 
CA_Open state, is somewhat similar to what I made FRTO years ago, and 
managed to introduces subtle breaking to the state machine. Please check 
that the problem similar to what was fixed by 
a1c1f281b84a751fdb5ff919da3b09df7297619f does not reappear (possibly in 
some other form causing spurious undos). ...To me it seems that 
tcp_packet_delayed might be similarly confused after the given patch.

-- 
 i.

^ permalink raw reply

* Re: NFS TCP race condition with SOCK_ASYNC_NOSPACE
From: Trond Myklebust @ 2011-11-22 12:45 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <4ECB96DA.9030202-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>

On Tue, 2011-11-22 at 12:34 +0000, Andrew Cooper wrote: 
> 
> On 22/11/11 12:22, Trond Myklebust wrote:
> > On Tue, 2011-11-22 at 12:16 +0000, Andrew Cooper wrote: 
> >> On 22/11/11 12:10, Trond Myklebust wrote:
> >>> On Tue, 2011-11-22 at 12:02 +0000, Andrew Cooper wrote: 
> >>>> On 22/11/11 11:38, Trond Myklebust wrote:
> >>>>> On Mon, 2011-11-21 at 18:14 +0000, Andrew Cooper wrote: 
> >>>>>> Following some debugging, I believe that the attached patch fixes the
> >>>>>> problem.
> >>>>>>
> >>>>>> Simply returning EAGAIN is not sufficient, as the task does not get
> >>>>>> requeued, and times out 13 seconds later (as per our mount options). 
> >>>>>> Setting the SOCK_ASYNC_NOSPACE bit causes the requeue to happen.
> >>>>>>
> >>>>>> I realize that this is a gross hack and I should probably not be using
> >>>>>> SOCK_ASYNC_NOSPACE in that way.  Is there a better way to achieve the
> >>>>>> same solution?
> >>>>>>
> >>>>> What you are doing will cause the request to be put to sleep with no
> >>>>> guarantee that it will ever be woken up. Why would we want to do that if
> >>>>> there is no report of a tcp window/buffer space congestion?
> >>>> But the reason we get to this code is because there was a report of
> >>>> space collision.  What would you suggest instead?  Changing
> >>>> xs_{tcp,udp}_send_request() to retry in this case would defeat the point
> >>>> of having xs_nospace().
> >>> I suggest doing absolutely nothing: do what you originally proposed,
> >>> which is to report the EAGAIN so that the client state machine retries
> >>> the socket write.
> >>>
> >>> My point is that this is a context which is _not_ atomic with the
> >>> original report of tcp window/buffer space congestion. There are no
> >>> locks or anything else that will guarantee that the congestion still
> >>> exists, and the fact that the SOCK_ASYNC_NOSPACE flag is now clear
> >>> indicates that this is the case.
> >>> The whole purpose of xs_nospace() is to wait until a congestion
> >>> condition clears. If the congestion clears before we get here, then we
> >>> have no reason to do anything special other than retry.
> >>>
> >>> Trond
> >> I am slightly confused as to what you mean now.
> >>
> >> When you take out the if(test_bit test and always set ret to EAGAIN and
> >> requeue the request, the next time it wakes up is when it is killed due
> >> to timeout.  This results in substantially worse effects for the
> >> userspace, as the NFS session is killed.
> > What is putting the request to sleep? It should be awake when it enters
> > xs_nospace(), and nothing in or after that function should be putting it
> > to sleep until we've retried with call_transmit().
> >
> 
> I presume it is the call to xprt_wait_for_buffer_space() which calls
> rpc_sleep_on().  There is xs_tcp_write_space which appears to wake it up
> based on sk->sk_write_space which is triggered on the sock gaining more
> space, which has already happened in this specific case.

That call should only happen if we have to wait for congestion (i.e. if
SOCK_ASYNC_NOSPACE is set). 

Please can you test if the following patch fixes the problem.

Cheers
  Trond

8<--------------------------------------------------------------------- 
>From 729b3861d9a2820735b648a044d558315bc9c9db Mon Sep 17 00:00:00 2001
From: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
Date: Tue, 22 Nov 2011 14:44:28 +0200
Subject: [PATCH] SUNRPC: Ensure we return EAGAIN in xs_nospace if congestion
 is cleared

By returning '0' instead of 'EAGAIN' when the tests in xs_nospace() fail
to find evidence of socket congestion, we are making the RPC engine believe
that the message was incompletely sent, and so it disconnects the socket
instead of just retrying the send.

The bug appears to have been introduced by commit
5e3771ce2d6a69e10fcc870cdf226d121d868491 (SUNRPC: Ensure that xs_nospace
return values are propagated).

Reported-by: Andrew Cooper <andrew.cooper3-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [>= 2.6.30]
---
 net/sunrpc/xprtsock.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 2d78d95..55472c4 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -496,7 +496,7 @@ static int xs_nospace(struct rpc_task *task)
 	struct rpc_rqst *req = task->tk_rqstp;
 	struct rpc_xprt *xprt = req->rq_xprt;
 	struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
-	int ret = 0;
+	int ret = -EAGAIN;
 
 	dprintk("RPC: %5u xmit incomplete (%u left of %u)\n",
 			task->tk_pid, req->rq_slen - req->rq_bytes_sent,
@@ -508,7 +508,6 @@ static int xs_nospace(struct rpc_task *task)
 	/* Don't race with disconnect */
 	if (xprt_connected(xprt)) {
 		if (test_bit(SOCK_ASYNC_NOSPACE, &transport->sock->flags)) {
-			ret = -EAGAIN;
 			/*
 			 * Notify TCP that we're limited by the application
 			 * window size
-- 
1.7.7.3


-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org
www.netapp.com

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH 2/4] ipv4: Update pmtu informations on inetpeer only for output routes
From: Steffen Klassert @ 2011-11-22 13:20 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20111114.143320.773675445837769668.davem@davemloft.net>

On Mon, Nov 14, 2011 at 02:33:20PM -0500, David Miller wrote:
> From: Steffen Klassert <steffen.klassert@secunet.com>
> Date: Mon, 14 Nov 2011 11:12:44 +0100
> 
> > So for the moment I'm thinking about adding an ip_dst_mtu()
> > function that returns dst->ops->default_mtu() for input routes
> > and dst_mtu() for output routes. Then we could convert the
> > dst_mtu() users in net/ipv4/ over to this one.
> 
> We'll need something similar for ipv6 eventually...
> 
> I would suggest that we do away with dst_ops->default_mtu() and just
> have dst_ops->mtu() which gets invoked unconditionally by dst_mtu().
> 

I found another pmtu related issue. Since commit 2774c131b
(xfrm: Handle blackhole route creation via afinfo)
we create a blackhole route even on packet forwarding
if we have a xfrm policy but we don't have yet the states.

In this case, the packet is not dropped immediately
but continues to travel in the packet forwarding path.
This means that the blackhole route's dst_ops->default_mtu()
method is invoked which returns a mtu of null. So
we announce a pmtu of null to the original sender of
the packet.

The simplest fix would be to return e.g. IP_MAX_MTU
as the default mtu on blackhole routes. But actually
I don't see why we should create a blackhole route on
packet forwarding as long as we finally drop the
packet anyway. So perhaps it is better to create
blackhole routes just in the case when we have socket
context.

^ permalink raw reply

* Re: NFS TCP race condition with SOCK_ASYNC_NOSPACE
From: Andrew Cooper @ 2011-11-22 13:23 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <1321965938.7645.13.camel-SyLVLa/KEI9HwK5hSS5vWB2eb7JE58TQ@public.gmane.org>

On 22/11/11 12:45, Trond Myklebust wrote:
> On Tue, 2011-11-22 at 12:34 +0000, Andrew Cooper wrote: 
>> On 22/11/11 12:22, Trond Myklebust wrote:
>>> On Tue, 2011-11-22 at 12:16 +0000, Andrew Cooper wrote: 
>>>> On 22/11/11 12:10, Trond Myklebust wrote:
>>>>> On Tue, 2011-11-22 at 12:02 +0000, Andrew Cooper wrote: 
>>>>>> On 22/11/11 11:38, Trond Myklebust wrote:
>>>>>>> On Mon, 2011-11-21 at 18:14 +0000, Andrew Cooper wrote: 
>>>>>>>> Following some debugging, I believe that the attached patch fixes the
>>>>>>>> problem.
>>>>>>>>
>>>>>>>> Simply returning EAGAIN is not sufficient, as the task does not get
>>>>>>>> requeued, and times out 13 seconds later (as per our mount options). 
>>>>>>>> Setting the SOCK_ASYNC_NOSPACE bit causes the requeue to happen.
>>>>>>>>
>>>>>>>> I realize that this is a gross hack and I should probably not be using
>>>>>>>> SOCK_ASYNC_NOSPACE in that way.  Is there a better way to achieve the
>>>>>>>> same solution?
>>>>>>>>
>>>>>>> What you are doing will cause the request to be put to sleep with no
>>>>>>> guarantee that it will ever be woken up. Why would we want to do that if
>>>>>>> there is no report of a tcp window/buffer space congestion?
>>>>>> But the reason we get to this code is because there was a report of
>>>>>> space collision.  What would you suggest instead?  Changing
>>>>>> xs_{tcp,udp}_send_request() to retry in this case would defeat the point
>>>>>> of having xs_nospace().
>>>>> I suggest doing absolutely nothing: do what you originally proposed,
>>>>> which is to report the EAGAIN so that the client state machine retries
>>>>> the socket write.
>>>>>
>>>>> My point is that this is a context which is _not_ atomic with the
>>>>> original report of tcp window/buffer space congestion. There are no
>>>>> locks or anything else that will guarantee that the congestion still
>>>>> exists, and the fact that the SOCK_ASYNC_NOSPACE flag is now clear
>>>>> indicates that this is the case.
>>>>> The whole purpose of xs_nospace() is to wait until a congestion
>>>>> condition clears. If the congestion clears before we get here, then we
>>>>> have no reason to do anything special other than retry.
>>>>>
>>>>> Trond
>>>> I am slightly confused as to what you mean now.
>>>>
>>>> When you take out the if(test_bit test and always set ret to EAGAIN and
>>>> requeue the request, the next time it wakes up is when it is killed due
>>>> to timeout.  This results in substantially worse effects for the
>>>> userspace, as the NFS session is killed.
>>> What is putting the request to sleep? It should be awake when it enters
>>> xs_nospace(), and nothing in or after that function should be putting it
>>> to sleep until we've retried with call_transmit().
>>>
>> I presume it is the call to xprt_wait_for_buffer_space() which calls
>> rpc_sleep_on().  There is xs_tcp_write_space which appears to wake it up
>> based on sk->sk_write_space which is triggered on the sock gaining more
>> space, which has already happened in this specific case.
> That call should only happen if we have to wait for congestion (i.e. if
> SOCK_ASYNC_NOSPACE is set). 
>
> Please can you test if the following patch fixes the problem.
>
> Cheers
>   Trond

Yup - this works and seems like a rather more elegant solution.

Thanks for all you help - I have been chasing this bug on and off since
the middle of May.

> 8<--------------------------------------------------------------------- 
> From 729b3861d9a2820735b648a044d558315bc9c9db Mon Sep 17 00:00:00 2001
> From: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
> Date: Tue, 22 Nov 2011 14:44:28 +0200
> Subject: [PATCH] SUNRPC: Ensure we return EAGAIN in xs_nospace if congestion
>  is cleared
>
> By returning '0' instead of 'EAGAIN' when the tests in xs_nospace() fail
> to find evidence of socket congestion, we are making the RPC engine believe
> that the message was incompletely sent, and so it disconnects the socket
> instead of just retrying the send.
>
> The bug appears to have been introduced by commit
> 5e3771ce2d6a69e10fcc870cdf226d121d868491 (SUNRPC: Ensure that xs_nospace
> return values are propagated).
>
> Reported-by: Andrew Cooper <andrew.cooper3-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>

Tested-by: Andrew Cooper <andrew.cooper3-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>

> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [>= 2.6.30]
> ---
>  net/sunrpc/xprtsock.c |    3 +--
>  1 files changed, 1 insertions(+), 2 deletions(-)
>
> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> index 2d78d95..55472c4 100644
> --- a/net/sunrpc/xprtsock.c
> +++ b/net/sunrpc/xprtsock.c
> @@ -496,7 +496,7 @@ static int xs_nospace(struct rpc_task *task)
>  	struct rpc_rqst *req = task->tk_rqstp;
>  	struct rpc_xprt *xprt = req->rq_xprt;
>  	struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
> -	int ret = 0;
> +	int ret = -EAGAIN;
>  
>  	dprintk("RPC: %5u xmit incomplete (%u left of %u)\n",
>  			task->tk_pid, req->rq_slen - req->rq_bytes_sent,
> @@ -508,7 +508,6 @@ static int xs_nospace(struct rpc_task *task)
>  	/* Don't race with disconnect */
>  	if (xprt_connected(xprt)) {
>  		if (test_bit(SOCK_ASYNC_NOSPACE, &transport->sock->flags)) {
> -			ret = -EAGAIN;
>  			/*
>  			 * Notify TCP that we're limited by the application
>  			 * window size

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: softirq oops from b44_poll
From: Xander Hover @ 2011-11-22 13:43 UTC (permalink / raw)
  To: Peter P Waskiewicz Jr; +Cc: linux-kernel@vger.kernel.org, netdev
In-Reply-To: <1321917453.10276.3.camel@pjaxe>

Tested on 3.1.1 and 3.2_rc2+ now.
Fix is working as expected.

Kind regards,

Xander Hover


On Tue, Nov 22, 2011 at 12:17 AM, Peter P Waskiewicz Jr
<peter.p.waskiewicz.jr@intel.com> wrote:
> On Mon, 2011-11-21 at 05:58 -0800, Xander Hover wrote:
>> Hi all,
>>
>> I noticed the small discussion about the b44_poll OOPS and
>> I also have a uni-processor PC with a broadcom network device (b44)
>> that causes similar kernel OOPSes.
>>
>> Here is a (reproducible) trace that still shows up in kernel 3.1.1:
>>
>>  ------------[ cut here ]------------
>>     WARNING: at kernel/softirq.c:159 local_bh_enable+0x32/0x79()
>>     Hardware name: Dimension 2400
>>     Modules linked in: snd_seq_midi snd_emu10k1_synth snd_emux_synth
>> snd_seq_virmidi snd_seq_midi_emul snd_seq_oss snd_seq_midi_event
>> snd_seq snd_pcm_oss snd_mixer_oss bnep rfcomm cryptd aes_i586
>> aes_generic ecb btusb bluetooth rfkill ppdev snd_emu10k1 snd_rawmidi
>> snd_ac97_codec ac97_bus snd_pcm snd_seq_device snd_timer
>> snd_page_alloc dcdbas snd_util_mem parport_pc snd_hwdep snd parport
>> emu10k1_gp rtc_cmos gameport i2c_i801
>>     Pid: 0, comm: swapper Not tainted 3.1.1-gentoo #1
>>     Call Trace:
>>      [<c1022970>] warn_slowpath_common+0x65/0x7a
>>      [<c102699e>] ? local_bh_enable+0x32/0x79
>>      [<c1022994>] warn_slowpath_null+0xf/0x13
>>      [<c102699e>] local_bh_enable+0x32/0x79
>>      [<c134bfd8>] destroy_conntrack+0x7c/0x9b
>>      [<c134890b>] nf_conntrack_destroy+0x1f/0x26
>>      [<c132e3a6>] skb_release_head_state+0x74/0x83
>>      [<c132e286>] __kfree_skb+0xb/0x6b
>>      [<c132e30a>] consume_skb+0x24/0x26
>>      [<c127c925>] b44_poll+0xaa/0x449
>>      [<c1333ca1>] net_rx_action+0x3f/0xea
>>      [<c1026a44>] __do_softirq+0x5f/0xd5
>>      [<c10269e5>] ? local_bh_enable+0x79/0x79
>>      <IRQ>  [<c1026c32>] ? irq_exit+0x34/0x8d
>>      [<c1003628>] ? do_IRQ+0x74/0x87
>>      [<c13f5329>] ? common_interrupt+0x29/0x30
>>      [<c1006e18>] ? default_idle+0x29/0x3e
>>      [<c10015a7>] ? cpu_idle+0x2f/0x5d
>>      [<c13e91c5>] ? rest_init+0x79/0x7b
>>      [<c15c66a9>] ? start_kernel+0x297/0x29c
>>      [<c15c60b0>] ? i386_start_kernel+0xb0/0xb7
>>     ---[ end trace 583f33bb1aa207a9 ]---
>>
>>
>> However if I apply the following patch this error does not show up anymore:
>>
>>
>> diff --git a/drivers/net/ethernet/broadcom/b44.c
>> b/drivers/net/ethernet/broadcom/b44.c
>> index 4cf835d..3fb66d0 100644
>> --- a/drivers/net/ethernet/broadcom/b44.c
>> +++ b/drivers/net/ethernet/broadcom/b44.c
>> @@ -608,7 +608,7 @@ static void b44_tx(struct b44 *bp)
>>                                  skb->len,
>>                                  DMA_TO_DEVICE);
>>                 rp->skb = NULL;
>> -               dev_kfree_skb(skb);
>> +               dev_kfree_skb_irq(skb);
>
> I suspect the "right" way to fix this is to call dev_kfree_skb_any(skb);
> instead, since that will handle the in-interrupt case if that's where
> we're stuck.
>
> Can you try this patch (compile-tested only) and see if fixes the issue
> you're seeing:
>
> commit e36ef2c1a2b6b517ed43254eb89768794a049b1c
> Author: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
> Date:   Mon Nov 21 15:14:18 2011 -0800
>
> b44: Use dev_kfree_skb_any() in b44_tx()
>
> Reported issues when using dev_kfree_skb() on UP systems and
> systems with low numbers of cores.  dev_kfree_skb_any() will
> properly save IRQ state before freeing the skb, depending on
> how b44_tx() is invoked.
>
> Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
> ---
>
>  drivers/net/ethernet/broadcom/b44.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
>
>
> diff --git a/drivers/net/ethernet/broadcom/b44.c
> b/drivers/net/ethernet/broadcom/b44.c
> index 4cf835d..6a7c39b 100644
> --- a/drivers/net/ethernet/broadcom/b44.c
> +++ b/drivers/net/ethernet/broadcom/b44.c
> @@ -608,7 +608,7 @@ static void b44_tx(struct b44 *bp)
>                                 skb->len,
>                                 DMA_TO_DEVICE);
>                rp->skb = NULL;
> -               dev_kfree_skb(skb);
> +               dev_kfree_skb_any(skb);
>        }
>
>        bp->tx_cons = cons;
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply

* Re: 3.2-rc2+: Reported regressions from 3.0 and 3.1
From: Konrad Rzeszutek Wilk @ 2011-11-22 13:54 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux SCSI List, Florian Mickler, Network Development,
	Linux Wireless List, Linux Kernel Mailing List, DRI, Linux ACPI,
	Andrew Morton, Kernel Testers List, Linus Torvalds, Linux PM List,
	Maciej Rutecki
In-Reply-To: <201111212249.31196.rjw@sisk.pl>

> Subject    : Regression in 3.1 causes Xen to use wrong idle routine
> Submitter  : Stefan Bader <stefan.bader@canonical.com>
> Date       : 2011-10-26 10:24
> Message-ID : 4EA7DFD1.9060608@canonical.com
> References : http://marc.info/?l=linux-acpi&m=131962467924564&w=2

The patch mentioned in http://mid.gmane.org/20111115144004.GE22675@phenom.dumpdata.com 
should do it. But the patch needs an Ack from ACPI/x86 folks.

^ permalink raw reply

* Re: MPLS for Linux kernel
From: Igor Maravić @ 2011-11-22 13:55 UTC (permalink / raw)
  To: jorge; +Cc: netdev, davem
In-Reply-To: <4ECB95D7.7030702@dti2.net>

2011/11/22 Jorge Boncompte [DTI2] <jorge@dti2.net>:
>        You keep insisting in that you fixed a lot of things, but you have provided a
> git tree with just one big commit and say that have taken some of my patches on
> it, could you please provide patches on TOP of the sourceforge code for the
> things that are not fixed there?

I insist on that because I did do a lot of things. When I did send you
patch, on TOP of your net-next code, about the most important bug
(stack overflow that was happening when mpls nhlfe entry was built)
that I fixed it was just ignored. Unfortunately I started  fixing MPLS
code without git, and I added your patches manually.

> It seems to me that you have not noticed that while
> fixing bugs I have reworked a lot of code to make it cleaner or simpler, simply
> deleted it and fixed the style.

Yes I saw that

>        What's needs to be done, and it's on my TODO list...
>
>        The kernel code that is not commented out on the mpls-linux code when you build
> the kernel it the shim layer and it's not done on purpose. This code was written
> by James to be a generic feature of the networking layer. Now I am not sure that
> it has any value keeping it and am for removing it.

I didn't understand what did You want to say here.

>        The other thing that probably I am going to remove is the labelspace support. I
> don't see a use for it, and even Cisco doesn't implement it either that I know.

That's 15 min of work, but I think that labelspaces should stay.

>        Then we must rework the netlink interface to make it cleaner and extensible.

What do you mean by extensible? With my netlink code you can add,
change and delete ilm, nhlfe and xc entries without any problem. What
other could one wish for? As I could see in your code you can't change
ilm, nhlfe and xc entries.

>        Check the dst's usages, there has been a lot of changes in the core kernel here
> lately and I am not sure if we are using it correctly.

As far I could see you did a great job of using nhlfe as dst entry.
Problem was when you delete nhlfe entry that is referenced with ilm
entry. It shouldn't be allowed to be deleted, or the ilm should change
last instructions from fwd to peek. I did the last thing.
Also there was the problem with neighbor hh_cache, because we are
using nhlfe as dst_entry but I fixed that too.  (hh_cache wouldn't
change when we are sending packets with diferent type (ETH_P_IP and
ETH_P_MPLS_UC))
Also ilm shouldn't be dst_entry. I'm going to change that.

>        Check the locking and RCU usage.

There where few problems about that, that I fixed.

BR
Igor

^ permalink raw reply

* Re: skb->timestamp == 0.000000 ?
From: Felipe Dias @ 2011-11-22 14:04 UTC (permalink / raw)
  To: Benjamin Poirier; +Cc: netdev
In-Reply-To: <20111121211140.GB24753@d2.suse.de>

On Mon, Nov 21, 2011 at 6:11 PM, Benjamin Poirier
<benjamin.poirier@gmail.com> wrote:
> On 11/11/19 18:13, Felipe Dias wrote:
>> And time_skb.tv_sec and time_skb.tv_usec are equal to 0;
>>
>> I'm doing some wrong ?
>
> Timestamps have to be enabled
> have a look at:
> net_enable_timestamp()
>

who is responsible for enabling? Would not it be the driver?

Some packets (icmp) appear timestamp other packages do not appear.

I'm confused, in my mind, after the registration of the "protocol
handler" (with dev_add_pack) any skb should have the timestamp.

Sorry, but I'm confused.

Best regards,
Felipe.

^ permalink raw reply

* [PATCH net-next V2] tg3: Fix advertisement handling
From: Hiroaki SHIMODA @ 2011-11-22 14:05 UTC (permalink / raw)
  To: davem, mcarlson, mchan; +Cc: eric.dumazet, netdev

Commit 28011cf19b (net: Add ethtool to mii advertisment conversion
helpers) added a helper function ethtool_adv_to_mii_100bt() and
tg3_copper_is_advertising_all(), tg3_phy_autoneg_cfg() were
modified to use this.
Before that commit, ethtool to mii advertisement conversion was
done wrt speed, but now pause operation is also taken account.
So, in tg3_copper_is_advertising_all(), below condition becomes
true and this makes link up fails.

	if ((adv_reg & ADVERTISE_ALL) != all_mask)
		return 0;

To fix this add ADVERTISE_ALL bit and operation to cap speed,
and change default advertisement not including ADVERTISED_Pause.

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com>
---
V2: change default advertisement not including ADVERTISED_Pause

 drivers/net/ethernet/broadcom/tg3.c |    7 +++----
 1 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index d9e9c8c..ff3c48a 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -3594,7 +3594,7 @@ static int tg3_phy_autoneg_cfg(struct tg3 *tp, u32 advertise, u32 flowctrl)
 	u32 val, new_adv;
 
 	new_adv = ADVERTISE_CSMA;
-	new_adv |= ethtool_adv_to_mii_adv_t(advertise);
+	new_adv |= ethtool_adv_to_mii_adv_t(advertise) & ADVERTISE_ALL;
 	new_adv |= tg3_advert_flowctrl_1000T(flowctrl);
 
 	err = tg3_writephy(tp, MII_ADVERTISE, new_adv);
@@ -3778,7 +3778,7 @@ static int tg3_copper_is_advertising_all(struct tg3 *tp, u32 mask)
 {
 	u32 adv_reg, all_mask = 0;
 
-	all_mask = ethtool_adv_to_mii_adv_t(mask);
+	all_mask = ethtool_adv_to_mii_adv_t(mask) & ADVERTISE_ALL;
 
 	if (tg3_readphy(tp, MII_ADVERTISE, &adv_reg))
 		return 0;
@@ -13199,8 +13199,7 @@ static u32 __devinit tg3_read_otp_phycfg(struct tg3 *tp)
 
 static void __devinit tg3_phy_init_link_config(struct tg3 *tp)
 {
-	u32 adv = ADVERTISED_Autoneg |
-		  ADVERTISED_Pause;
+	u32 adv = ADVERTISED_Autoneg;
 
 	if (!(tp->phy_flags & TG3_PHYFLG_10_100_ONLY))
 		adv |= ADVERTISED_1000baseT_Half |

^ permalink raw reply related

* Re: skb->timestamp == 0.000000 ?
From: Eric Dumazet @ 2011-11-22 14:10 UTC (permalink / raw)
  To: Felipe Dias; +Cc: Benjamin Poirier, netdev
In-Reply-To: <CAJX4=r2XBGGzMFgeKC-TvZ-WQ9skwoBFA5na_XiLNqDn0U9V+Q@mail.gmail.com>

Le mardi 22 novembre 2011 à 11:04 -0300, Felipe Dias a écrit :

> who is responsible for enabling? Would not it be the driver?
> 
> Some packets (icmp) appear timestamp other packages do not appear.
> 
> I'm confused, in my mind, after the registration of the "protocol
> handler" (with dev_add_pack) any skb should have the timestamp.
> 

This is not the trigger for timestamps.

You need to call setsockopt(SO_TIMESTAMP or SO_TIMESTAMPNS) on at least
one socket.

^ permalink raw reply

* Re: MPLS for Linux kernel
From: Jorge Boncompte [DTI2] @ 2011-11-22 14:33 UTC (permalink / raw)
  To: igorm; +Cc: netdev
In-Reply-To: <CAFdo_mXD7by23i38GZHePiHhESpcX66vamBUqHU-fQQFU1RcaA@mail.gmail.com>

El 22/11/2011 14:55, Igor Maravić escribió:
> 2011/11/22 Jorge Boncompte [DTI2] <jorge@dti2.net>:
>>        You keep insisting in that you fixed a lot of things, but you have provided a
>> git tree with just one big commit and say that have taken some of my patches on
>> it, could you please provide patches on TOP of the sourceforge code for the
>> things that are not fixed there?
> 
> I insist on that because I did do a lot of things. When I did send you
> patch, on TOP of your net-next code, about the most important bug
> (stack overflow that was happening when mpls nhlfe entry was built)
> that I fixed it was just ignored. Unfortunately I started  fixing MPLS
> code without git, and I added your patches manually.

	(Dropped Dave from the cc: list)

	But you failed to provide patches, that fix ONE thing at a time, and not a
patch with all the work that you have done, that's what you sent me for the
stack overflow bug or have done now with the git tree. Providing clean,
separated patches it's YOUR work if you WANT to see them applied on the
mpls-linux tree.

>>        The kernel code that is not commented out on the mpls-linux code when you build
>> the kernel it the shim layer and it's not done on purpose. This code was written
>> by James to be a generic feature of the networking layer. Now I am not sure that
>> it has any value keeping it and am for removing it.
> 
> I didn't understand what did You want to say here.

	You have #ifdefed the MPLS code in the core networking code, that's wrong,
that's not the way to go if you want to see the code merged upstream. The shim
layer was thought as a core component of the kernel. If we rip it what we come
at with should be #ifdef-less.

>>        The other thing that probably I am going to remove is the labelspace support. I
>> don't see a use for it, and even Cisco doesn't implement it either that I know.
> 
> That's 15 min of work, but I think that labelspaces should stay.

	Yes, I dit it an hour ago on a private branch. :) Why should did it stay?

>>        Then we must rework the netlink interface to make it cleaner and extensible.
> 
> What do you mean by extensible? With my netlink code you can add,
> change and delete ilm, nhlfe and xc entries without any problem. What
> other could one wish for? As I could see in your code you can't change
> ilm, nhlfe and xc entries.

	Yes, I did not finish the change code because no ones uses it currently
(iproute, quagga). In my opinion the instructions should be nested attributes,
and we have to change how the MPLS_CHANGE_* flags get passed, currently it's a hack.

> 
>>        Check the dst's usages, there has been a lot of changes in the core kernel here
>> lately and I am not sure if we are using it correctly.
> 
> As far I could see you did a great job of using nhlfe as dst entry.
> Problem was when you delete nhlfe entry that is referenced with ilm
> entry. It shouldn't be allowed to be deleted, or the ilm should change
> last instructions from fwd to peek. I did the last thing.

	If I delete an nhlfe I for sure don't want the traffic for be routed instead of
dropped. Instructions are policy set by the user and you shouldn't change them
under the hood.

> Also there was the problem with neighbor hh_cache, because we are
> using nhlfe as dst_entry but I fixed that too.  (hh_cache wouldn't
> change when we are sending packets with diferent type (ETH_P_IP and
> ETH_P_MPLS_UC))

	patch.

> Also ilm shouldn't be dst_entry. I'm going to change that.

	I agree 100%. I did forgot it.

	Regards,
		Jorge

^ permalink raw reply

* Re: MPLS for Linux kernel
From: Jorge Boncompte [DTI2] @ 2011-11-22 14:35 UTC (permalink / raw)
  To: netdev
In-Reply-To: <4ECBB2C4.9000505@dti2.net>

El 22/11/2011 15:33, Jorge Boncompte [DTI2] escribió:
> 
> 	(Dropped Dave from the cc: list)

	That should have been David, sorry.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox