Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [Bugme-new] [Bug 38102] New: BUG kmalloc-2048: Poison overwritten
From: Eric Dumazet @ 2011-07-04 15:12 UTC (permalink / raw)
  To: Michael Büsch
  Cc: Alexey Zaytsev, Andrew Morton, netdev, Gary Zambrano,
	bugme-daemon, David S. Miller, Pekka Pietikainen,
	Florian Schirmer, Felix Fietkau, Michael Buesch
In-Reply-To: <20110704164351.338dc12e@maggie>

Le lundi 04 juillet 2011 à 16:43 +0200, Michael Büsch a écrit :
> On Mon, 4 Jul 2011 16:27:26 +0200
> Michael Büsch <m@bues.ch> wrote:
> > We do this in b43, which has exactly the same DMA engine.
> 
> (Ok, it turns out we don't do this in b43 (We only do it on the TX side).
>  But that's a bug. We should do a wmb() on the RX side before advancing the
>  descriptor ring pointer.)

I am wondering what happens if RX ring is set to 64, and we receive
exactly 64 buffers in one round, B44_DMARX_PTR wont change at all ?

Alexey, could you try this patch please ?

diff --git a/drivers/net/b44.c b/drivers/net/b44.c
index a69331e..51072a3 100644
--- a/drivers/net/b44.c
+++ b/drivers/net/b44.c
@@ -829,6 +829,7 @@ static int b44_rx(struct b44 *bp, int budget)
 	}
 
 	bp->rx_cons = cons;
+	wmb();
 	bw32(bp, B44_DMARX_PTR, cons * sizeof(struct dma_desc));
 
 	return received;



^ permalink raw reply related

* divide error: 0000, in bictcp_cong_avoid, kernel 2.6.39
From: TB @ 2011-07-04 14:23 UTC (permalink / raw)
  To: netdev

[1819042.176427] divide error: 0000 [#1] SMP

[1819042.176462] last sysfs file:
/sys/devices/pci0000:00/0000:00:1f.2/host6/scsi_host/host6/proc_name
[1819042.176511] CPU 0

[1819042.176518] Modules linked in:
 i2c_i801
 i2c_core
 evdev
 button
 [last unloaded: scsi_wait_scan]

[1819042.176600]
[1819042.176621] Pid: 14810, comm: nginx Not tainted 2.6.39 #1
 Supermicro X8DT3
/X8DT3

[1819042.176676] RIP: 0010:[<ffffffff81516499>]
 [<ffffffff81516499>] bictcp_cong_avoid+0x281/0x2bc
[1819042.176731] RSP: 0018:ffff88043fc03a50  EFLAGS: 00010246
[1819042.176758] RAX: 0000000000000000 RBX: ffff88001f84fa90 RCX:
0000000000000000
[1819042.176803] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
000000000000c170
[1819042.176847] RBP: 0000000000000025 R08: 00000000000000b5 R09:
00000000000068f2
[1819042.176892] R10: ffff88001e914c00 R11: 0000000000007f0a R12:
ffff88001f84f6c0
[1819042.176936] R13: 000000011b24cfc6 R14: 0000000000010015 R15:
000000000000003d
[1819042.176980] FS: 00007fa411dd4700(0000) GS:ffff88043fc00000(0000)
knlGS:0000000000000000
[1819042.177027] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[1819042.177055] CR2: 00000000058f8f76 CR3: 000000011c252000 CR4:
00000000000006f0
[1819042.177099] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[1819042.177143] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[1819042.177188] Process nginx (pid: 14810, threadinfo ffff8806ce9a0000,
task ffff88042f42f620)
[1819042.177234] Stack:
[1819042.177255]  ffff88001f84f9d0
 ffffffff81801620
 0000000000000282
 ffffffff81038dae

[1819042.177311]  0000000000000b0c
 ffffffff81054aa5
 ffff8800864cddc0
 ffff88001f84f6c0

[1819042.177368]  ffffffff81801620
 0000000000000406
 ffff88001f84f6c0
 ffffffff81801620

[1819042.177424] Call Trace:
[1819042.177446]  <IRQ>

[1819042.177474]  [<ffffffff81038dae>] ? ns_to_timeval+0x9/0x27
[1819042.177503]  [<ffffffff81054aa5>] ? getnstimeofday+0x55/0xaf
[1819042.177534]  [<ffffffff814ed04d>] ? tcp_ack+0x18b5/0x1a89
[1819042.177563]  [<ffffffff814ed84a>] ? tcp_rcv_established+0xd1/0xa13
[1819042.177593]  [<ffffffff814f5a5f>] ? tcp_v4_do_rcv+0x1b2/0x382
[1819042.177623]  [<ffffffff814d2774>] ? nf_iterate+0x40/0x78
[1819042.177651]  [<ffffffff814f60bc>] ? tcp_v4_rcv+0x48d/0x7a6
[1819042.177680]  [<ffffffff814da2d6>] ? ip_local_deliver_finish+0xae/0x13f
[1819042.177711]  [<ffffffff814b6179>] ? __netif_receive_skb+0x33a/0x369
[1819042.177741]  [<ffffffff814b7a3d>] ? netif_receive_skb+0x67/0x6d
[1819042.177771]  [<ffffffff814b7fb5>] ? napi_gro_receive+0x9d/0xab
[1819042.177800]  [<ffffffff814b7b12>] ? napi_skb_finish+0x1c/0x31
[1819042.177829]  [<ffffffff813edd3c>] ? igb_poll+0x7d5/0xb2e
[1819042.177859]  [<ffffffff81040c8c>] ? lock_timer_base+0x26/0x4c
[1819042.177888]  [<ffffffff814b80f2>] ? net_rx_action+0xa7/0x212
[1819042.177924]  [<ffffffff8103a43b>] ? __do_softirq+0xbe/0x184
[1819042.177954]  [<ffffffff8156d18c>] ? call_softirq+0x1c/0x30
[1819042.177983]  [<ffffffff81003ec9>] ? do_softirq+0x31/0x63
[1819042.178011]  [<ffffffff8103a1bc>] ? irq_exit+0x3f/0x9e
[1819042.178039]  [<ffffffff810037df>] ? do_IRQ+0x98/0xae
[1819042.178068]  [<ffffffff8156ba13>] ? common_interrupt+0x13/0x13
[1819042.178095]  <EOI>

[1819042.178122]  [<ffffffff812c07b2>] ? blk_finish_plug+0xb/0x2a
[1819042.178151]  [<ffffffff812d1e0d>] ? copy_user_generic_string+0x2d/0x40
[1819042.178184]  [<ffffffff8108785a>] ? file_read_actor+0xb9/0x136
[1819042.178213]  [<ffffffff81089243>] ? generic_file_aio_read+0x3a3/0x606
[1819042.178246]  [<ffffffff814a7f05>] ? sock_alloc_inode+0xaa/0xaa
[1819042.178278]  [<ffffffff81239eaa>] ? xfs_file_aio_read+0x219/0x26d
[1819042.178312]  [<ffffffff810c6bda>] ? do_sync_read+0xb0/0xf2
[1819042.178344]  [<ffffffff810c7072>] ? do_readv_writev+0x15f/0x174
[1819042.178377]  [<ffffffff810c7598>] ? vfs_read+0xaa/0x12e
[1819042.178405]  [<ffffffff810c7673>] ? sys_pread64+0x57/0x77
[1819042.178434]  [<ffffffff8156c03b>] ? system_call_fastpath+0x16/0x1b
[1819042.178463] Code:
29 e9 31 d2 89 e8 f7 f1 41 39 84 24 d0 03 00 00 76 08 41 89 84 24 d0 03
00 00
41 8b 84 24 d0 03 00 00 31 d2 c1 e0 04 0f b7 4b 2c f7>
f1 ba 01 00 00 00 85 c0 0f 45 d0 41 89 94 24 d0 03 00 00 41

[1819042.178736] RIP
 [<ffffffff81516499>] bictcp_cong_avoid+0x281/0x2bc
[1819042.178769]  RSP <ffff88043fc03a50>
[1819042.179048] ---[ end trace ebccdce72afe641d ]---
[1819042.179106] Kernel panic - not syncing: Fbictcp_cong_avoidatal
exception in interrupt
[1819042.179166] Pid: 14810, comm: nginx Tainted: G      D
2.6.39bakhoss #1
[1819042.179226] Call Trace:
[1819042.179279]  <IRQ>
 [<ffffffff81569594>] ? panic+0x9d/0x1a0
[1819042.179374]  [<ffffffff81004f80>] ? oops_end+0x61/0xac
[1819042.179432]  [<ffffffff8156ba13>] ? common_interrupt+0x13/0x13
[1819042.179497]  [<ffffffff810353da>] ? kmsg_dump+0x46/0xec
[1819042.179555]  [<ffffffff81004fbe>] ? oops_end+0x9f/0xac
[1819042.179613]  [<ffffffff8100313c>] ? do_divide_error+0x7f/0x89
[1819042.179672]  [<ffffffff81516499>] ? bictcp_cong_avoid+0x281/0x2bc
[1819042.179732]  [<ffffffff814b8a67>] ? dev_hard_start_xmit+0x3fc/0x581
[1819042.179793]  [<ffffffff814dc310>] ? ip_options_build+0x149/0x149
[1819042.179852]  [<ffffffff8156ceb5>] ? divide_error+0x15/0x20
[1819042.179911]  [<ffffffff81516499>] ? bictcp_cong_avoid+0x281/0x2bc
[1819042.179971]  [<ffffffff81038dae>] ? ns_to_timeval+0x9/0x27
[1819042.180030]  [<ffffffff81054aa5>] ? getnstimeofday+0x55/0xaf
[1819042.180089]  [<ffffffff814ed04d>] ? tcp_ack+0x18b5/0x1a89
[1819042.180150]  [<ffffffff814ed84a>] ? tcp_rcv_established+0xd1/0xa13
[1819042.180210]  [<ffffffff814f5a5f>] ? tcp_v4_do_rcv+0x1b2/0x382
[1819042.180270]  [<ffffffff814d2774>] ? nf_iterate+0x40/0x78
[1819042.180328]  [<ffffffff814f60bc>] ? tcp_v4_rcv+0x48d/0x7a6
[1819042.180394]  [<ffffffff814da2d6>] ? ip_local_deliver_finish+0xae/0x13f
[1819042.180457]  [<ffffffff814b6179>] ? __netif_receive_skb+0x33a/0x369
[1819042.180518]  [<ffffffff814b7a3d>] ? netif_receive_skb+0x67/0x6d
[1819042.180581]  [<ffffffff814b7fb5>] ? napi_gro_receive+0x9d/0xab
[1819042.180646]  [<ffffffff814b7b12>] ? napi_skb_finish+0x1c/0x31
[1819042.180706]  [<ffffffff813edd3c>] ? igb_poll+0x7d5/0xb2e
[1819042.180765]  [<ffffffff81040c8c>] ? lock_timer_base+0x26/0x4c
[1819042.180825]  [<ffffffff814b80f2>] ? net_rx_action+0xa7/0x212
[1819042.180884]  [<ffffffff8103a43b>] ? __do_softirq+0xbe/0x184
[1819042.180944]  [<ffffffff8156d18c>] ? call_softirq+0x1c/0x30
[1819042.181003]  [<ffffffff81003ec9>] ? do_softirq+0x31/0x63
[1819042.181061]  [<ffffffff8103a1bc>] ? irq_exit+0x3f/0x9e
[1819042.181118]  [<ffffffff810037df>] ? do_IRQ+0x98/0xae
[1819042.181176]  [<ffffffff8156ba13>] ? common_interrupt+0x13/0x13
[1819042.181235]  <EOI>
 [<ffffffff812c07b2>] ? blk_finish_plug+0xb/0x2a
[1819042.181330]  [<ffffffff812d1e0d>] ? copy_user_generic_string+0x2d/0x40
[1819042.181397]  [<ffffffff8108785a>] ? file_read_actor+0xb9/0x136
[1819042.181456]  [<ffffffff81089243>] ? generic_file_aio_read+0x3a3/0x606
[1819042.181517]  [<ffffffff814a7f05>] ? sock_alloc_inode+0xaa/0xaa
[1819042.181577]  [<ffffffff81239eaa>] ? xfs_file_aio_read+0x219/0x26d
[1819042.181637]  [<ffffffff810c6bda>] ? do_sync_read+0xb0/0xf2
[1819042.181696]  [<ffffffff810c7072>] ? do_readv_writev+0x15f/0x174
[1819042.181755]  [<ffffffff810c7598>] ? vfs_read+0xaa/0x12e
[1819042.181813]  [<ffffffff810c7673>] ? sys_pread64+0x57/0x77
[1819042.181873]  [<ffffffff8156c03b>] ? system_call_fastpath+0x16/0x1b

^ permalink raw reply

* Re: [PATCH 2/2] Update documented default values for various TCP/UDP tunables
From: Neil Horman @ 2011-07-04 14:58 UTC (permalink / raw)
  To: Max Matveev; +Cc: linux-sctp, netdev
In-Reply-To: <20110704083616.43F2C8156C57@regina.usersys.redhat.com>

On Wed, Jun 22, 2011 at 05:18:13PM +1000, Max Matveev wrote:
> tcp_rmem and tcp_wmem use 1 page as default value for the minimum
> amount of memory to be used, same as udp_wmem_min and udp_rmem_min.
> Pages are different size on different architectures - use the right
> units when describing the defaults.
> 
> Signed-off-by: Max Matveev <makc@redhat.com>
> ---
>  Documentation/networking/ip-sysctl.txt |    8 ++++----
>  1 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index f37d374..387ee53 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -394,7 +394,7 @@ tcp_rmem - vector of 3 INTEGERs: min, default, max
>  	min: Minimal size of receive buffer used by TCP sockets.
>  	It is guaranteed to each TCP socket, even under moderate memory
>  	pressure.
> -	Default: 8K
> +	Default: 1 page
>  
>  	default: initial size of receive buffer used by TCP sockets.
>  	This value overrides net.core.rmem_default used by other protocols.
> @@ -483,7 +483,7 @@ tcp_window_scaling - BOOLEAN
>  tcp_wmem - vector of 3 INTEGERs: min, default, max
>  	min: Amount of memory reserved for send buffers for TCP sockets.
>  	Each TCP socket has rights to use it due to fact of its birth.
> -	Default: 4K
> +	Default: 1 page
>  
>  	default: initial size of send buffer used by TCP sockets.  This
>  	value overrides net.core.wmem_default used by other protocols.
> @@ -553,13 +553,13 @@ udp_rmem_min - INTEGER
>  	Minimal size of receive buffer used by UDP sockets in moderation.
>  	Each UDP socket is able to use the size for receiving data, even if
>  	total pages of UDP sockets exceed udp_mem pressure. The unit is byte.
> -	Default: 4096
> +	Default: 1 page
>  
>  udp_wmem_min - INTEGER
>  	Minimal size of send buffer used by UDP sockets in moderation.
>  	Each UDP socket is able to use the size for sending data, even if
>  	total pages of UDP sockets exceed udp_mem pressure. The unit is byte.
> -	Default: 4096
> +	Default: 1 page
>  
Acked-by: Neil Horman <nhorman@tuxdriver.com>


^ permalink raw reply

* Re: [PATCH 1/2] Update description of net.sctp.sctp_rmem and net.sctp.sctp_wmem tunables
From: Neil Horman @ 2011-07-04 14:54 UTC (permalink / raw)
  To: Max Matveev; +Cc: linux-sctp, netdev
In-Reply-To: <20110704083605.AF9C28156C57@regina.usersys.redhat.com>

On Mon, Jun 20, 2011 at 06:08:10PM +1000, Max Matveev wrote:
> sctp does not use second and third ("default" and "max") values
> of sctp_(r|w)mem tunables. The format is the same and tcp_(r|w)mem
> but the meaning is different so make the documentation explicit to
> avoid confusion.
> 
> Signed-off-by: Max Matveev <makc@redhat.com>
> ---
>  Documentation/networking/ip-sysctl.txt |   11 +++++++++--
>  1 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index d3d653a..f37d374 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -1465,10 +1465,17 @@ sctp_mem - vector of 3 INTEGERs: min, pressure, max
>  	Default is calculated at boot time from amount of available memory.
>  
>  sctp_rmem - vector of 3 INTEGERs: min, default, max
> -	See tcp_rmem for a description.
> +	Only the first value ("min") is used, "default" and "max" are
> +	ignored and may be removed in the future versions.
> +
Its accurate to say that only the first value is usd currently, but because of
the way this sysctl is contructed (its used by the sysctl_rmem pointer in the
sctp_prot struct, which expects an array of three integers in the commong
__sk_mem_schedule function), we wont' be removing the other two values.  Drop
that bit and its an ack from me.
Neil

> +	min: Minimal size of receive buffer used by SCTP socket.
> +	It is guaranteed to each STCP socket (but not association) even 
> +	under moderate memory pressure.
> +
> +	Default: 1 page
>  
>  sctp_wmem  - vector of 3 INTEGERs: min, default, max
> -	See tcp_wmem for a description.
> +	Currently this tunable has no effect.
>  
>  addr_scope_policy - INTEGER
>  	Control IPv4 address scoping - draft-stewart-tsvwg-sctp-ipv4-00
> -- 
> 1.7.3.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [Bugme-new] [Bug 38102] New: BUG kmalloc-2048: Poison overwritten
From: Eric Dumazet @ 2011-07-04 14:53 UTC (permalink / raw)
  To: Michael Büsch
  Cc: Alexey Zaytsev, Andrew Morton, netdev, Gary Zambrano,
	bugme-daemon, David S. Miller, Pekka Pietikainen,
	Florian Schirmer, Felix Fietkau, Michael Buesch
In-Reply-To: <20110704164351.338dc12e@maggie>

Le lundi 04 juillet 2011 à 16:43 +0200, Michael Büsch a écrit :
> On Mon, 4 Jul 2011 16:27:26 +0200
> Michael Büsch <m@bues.ch> wrote:
> > We do this in b43, which has exactly the same DMA engine.
> 
> (Ok, it turns out we don't do this in b43 (We only do it on the TX side).
>  But that's a bug. We should do a wmb() on the RX side before advancing the
>  descriptor ring pointer.)


Also it appears rx_ring (or tx_ring ) are not necessarily 4K aligned if
kzalloc() path is taken.

(SL?B DEBUG -> kzalloc(4096) might not be 4K aligned)



^ permalink raw reply

* Re: [Bugme-new] [Bug 38102] New: BUG kmalloc-2048: Poison overwritten
From: Michael Büsch @ 2011-07-04 14:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexey Zaytsev, Andrew Morton, netdev, Gary Zambrano,
	bugme-daemon, David S. Miller, Pekka Pietikainen,
	Florian Schirmer, Felix Fietkau, Michael Buesch
In-Reply-To: <1309790705.2247.22.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

On Mon, 04 Jul 2011 16:45:05 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le lundi 04 juillet 2011 à 16:31 +0200, Michael Büsch a écrit :
> > On Mon, 04 Jul 2011 16:00:49 +0200
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > > And an other question. Why so we have the logic to work-around the 1Gb
> > > > DMA limit instead of just setting the dma mask?
> > > 
> > > Your problem is in RX side : NIC actually writes to a buffer that is
> > > supposedly not its property.
> > 
> > The problem is on both sides, because some Linux architectures simply
> > do not support any DMA mask less than 32. This applied to i386 (IA32) last
> > time I looked.
> > The b44 DMA engine can only address 30-bits.
> 
> 
> Michael, traces provided by Alexey are in the RX path.
> 
> NIC does a DMA  (Receives an UDP frame) into a 2048 bytes buffers that
> was freed.

Yeah sure. That's obvious from the logs.
By "the problem" I meant "the 30bit limitation".

^ permalink raw reply

* Re: [Bugme-new] [Bug 38102] New: BUG kmalloc-2048: Poison overwritten
From: Eric Dumazet @ 2011-07-04 14:45 UTC (permalink / raw)
  To: Michael Büsch
  Cc: Alexey Zaytsev, Andrew Morton, netdev, Gary Zambrano,
	bugme-daemon, David S. Miller, Pekka Pietikainen,
	Florian Schirmer, Felix Fietkau, Michael Buesch
In-Reply-To: <20110704163134.7f2180b2@maggie>

Le lundi 04 juillet 2011 à 16:31 +0200, Michael Büsch a écrit :
> On Mon, 04 Jul 2011 16:00:49 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > And an other question. Why so we have the logic to work-around the 1Gb
> > > DMA limit instead of just setting the dma mask?
> > 
> > Your problem is in RX side : NIC actually writes to a buffer that is
> > supposedly not its property.
> 
> The problem is on both sides, because some Linux architectures simply
> do not support any DMA mask less than 32. This applied to i386 (IA32) last
> time I looked.
> The b44 DMA engine can only address 30-bits.


Michael, traces provided by Alexey are in the RX path.

NIC does a DMA  (Receives an UDP frame) into a 2048 bytes buffers that
was freed.




^ permalink raw reply

* Re: [Bugme-new] [Bug 38102] New: BUG kmalloc-2048: Poison overwritten
From: Michael Büsch @ 2011-07-04 14:43 UTC (permalink / raw)
  To: Michael Büsch
  Cc: Eric Dumazet, Alexey Zaytsev, Andrew Morton, netdev,
	Gary Zambrano, bugme-daemon, David S. Miller, Pekka Pietikainen,
	Florian Schirmer, Felix Fietkau, Michael Buesch
In-Reply-To: <20110704162726.4072e715@maggie>

On Mon, 4 Jul 2011 16:27:26 +0200
Michael Büsch <m@bues.ch> wrote:
> We do this in b43, which has exactly the same DMA engine.

(Ok, it turns out we don't do this in b43 (We only do it on the TX side).
 But that's a bug. We should do a wmb() on the RX side before advancing the
 descriptor ring pointer.)

^ permalink raw reply

* Re: [Bugme-new] [Bug 38102] New: BUG kmalloc-2048: Poison overwritten
From: Michael Büsch @ 2011-07-04 14:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexey Zaytsev, Andrew Morton, netdev, Gary Zambrano,
	bugme-daemon, David S. Miller, Pekka Pietikainen,
	Florian Schirmer, Felix Fietkau, Michael Buesch
In-Reply-To: <1309788049.2247.9.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

On Mon, 04 Jul 2011 16:00:49 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > And an other question. Why so we have the logic to work-around the 1Gb
> > DMA limit instead of just setting the dma mask?
> 
> Your problem is in RX side : NIC actually writes to a buffer that is
> supposedly not its property.

The problem is on both sides, because some Linux architectures simply
do not support any DMA mask less than 32. This applied to i386 (IA32) last
time I looked.
The b44 DMA engine can only address 30-bits.

^ permalink raw reply

* Re: [Bugme-new] [Bug 38102] New: BUG kmalloc-2048: Poison overwritten
From: Michael Büsch @ 2011-07-04 14:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexey Zaytsev, Andrew Morton, netdev, Gary Zambrano,
	bugme-daemon, David S. Miller, Pekka Pietikainen,
	Florian Schirmer, Felix Fietkau, Michael Buesch
In-Reply-To: <1309787822.2247.6.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

On Mon, 04 Jul 2011 15:57:02 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> I dont have the b44 specs, but :

It uses the 30-bit version of the Broadcom HND engine, for which complete
specs are available here:
http://bcm-v4.sipsolutions.net/802.11/DMA

> For sure, addr should be set before ctl, just in case ctl allows chip to
> start a dma transfert (to previous packet), because a OWN bit is unset
> for example...

Certainly not.
The device does not know about the buffer until this line:
 bw32(bp, B44_DMARX_PTR, cons * sizeof(struct dma_desc));
which advances the DMA descriptor pointer.
However, I do think we probably need a wmb() right before that bw32() line, to make sure
memory is committed before we tell the device it's OK to access it.
We do this in b43, which has exactly the same DMA engine.

^ permalink raw reply

* [PATCH] fix WOL on 2nd port on i350
From: Martin Wilck @ 2011-07-04 14:17 UTC (permalink / raw)
  To: alexander.h.duyck, jeffrey.t.kirsher, netdev, e1000-devel; +Cc: Uhe, Rudolf

[-- Attachment #1: Type: text/plain, Size: 538 bytes --]

WOL fails on second port of a i350 network adapter with the latest
upstream kernel driver. It works with the sf.net driver 3.0.19. The
following patch seems to be missing upstream.

Regards,
Martin

-- 
Dr. Martin Wilck
PRIMERGY System Software Engineer
x86 Server Engineering

FUJITSU
Fujitsu Technology Solutions GmbH
Heinz-Nixdorf-Ring 1
33106 Paderborn, Germany
Phone:			++49 5251 525 2796
Fax:			++49 5251 525 2820
Email:			martin.wilck@ts.fujitsu.com
Internet:		http://ts.fujitsu.com
Company Details:	http://ts.fujitsu.com/imprint


[-- Attachment #2: igb_i350_wol.diff --]
[-- Type: text/plain, Size: 539 bytes --]

--- linus/drivers/net/igb/igb_main.c	2011-06-13 00:17:28.000000000 +0200
+++ linus/drivers/net/igb/igb_main.c.new	2011-07-04 14:24:00.000000000 +0200
@@ -1985,7 +1985,7 @@ static int __devinit igb_probe(struct pc
 
 	if (hw->bus.func == 0)
 		hw->nvm.ops.read(hw, NVM_INIT_CONTROL3_PORT_A, 1, &eeprom_data);
-	else if (hw->mac.type == e1000_82580)
+	else if (hw->mac.type >= e1000_82580)
 		hw->nvm.ops.read(hw, NVM_INIT_CONTROL3_PORT_A +
 		                 NVM_82580_LAN_FUNC_OFFSET(hw->bus.func), 1,
 		                 &eeprom_data);


[-- Attachment #3: Type: text/plain, Size: 377 bytes --]

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2

[-- Attachment #4: Type: text/plain, Size: 257 bytes --]

_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply

* Re: libpcap and tc filters
From: Adam Katz @ 2011-07-04 14:16 UTC (permalink / raw)
  To: jhs; +Cc: netdev
In-Reply-To: <1309788416.26180.63.camel@mojatatu>

thanks a lot
I can tell you I'm not the first one to have this problem, but it
doesn't seem to be common... but that's probably because people
usually don't try to shape traffic sent using libpcap.

I found this post from 2003 on lartc with the exact same problem but
with no replies:
http://mailman.ds9a.nl/pipermail/lartc/2003q4/011004.html


On Mon, Jul 4, 2011 at 5:06 PM, jamal <hadi@cyberus.ca> wrote:
> On Mon, 2011-07-04 at 16:24 +0300, Adam Katz wrote:
>> ok, I checked now and the packets sent by tcpreplay are identical to
>> the ones captured originally by wireshark.
>
> Ok - thanks for removing that variable.
>
>> I'm using the stock ubuntu 10.04 kernel that wasn't compiled with
>> CONFIG_CLS_U32_PERF so sudo tc -s filter ls dev eth1 shows nothing
>> useful (and i'm not sure that recompiling the entire kernel is worth
>> it to tell me what I already know - that these packets missed the
>> filters... but i'm willing to do it if you think that'll help).
>
> Not necessary as long as you can tell where the packets end up.
>
>> Anyway, I suspect the problem to be something else - I suspect that
>> the packets sent using tcpreplay simply skip the filters in the kernel
>> and are being injected somewhere afterwards. But this theory is
>> problematic since I find it strange that the packets do end up in the
>> default queue after all - hence they ARE seen by tc and they don't
>> skip tc entirely.
>
> I am not sure off top of my head why that would happen. I will try later
> to install tcpreplay and reproduce your test.
>
> cheers,
> jamal
>
>

^ permalink raw reply

* Re: libpcap and tc filters
From: jamal @ 2011-07-04 14:06 UTC (permalink / raw)
  To: Adam Katz; +Cc: netdev
In-Reply-To: <CAA0qwj4=bOU3LDsRk591AtBWx2c_6uFwk0M4-AGrnxCPKjbbrw@mail.gmail.com>

On Mon, 2011-07-04 at 16:24 +0300, Adam Katz wrote:
> ok, I checked now and the packets sent by tcpreplay are identical to
> the ones captured originally by wireshark.

Ok - thanks for removing that variable.

> I'm using the stock ubuntu 10.04 kernel that wasn't compiled with
> CONFIG_CLS_U32_PERF so sudo tc -s filter ls dev eth1 shows nothing
> useful (and i'm not sure that recompiling the entire kernel is worth
> it to tell me what I already know - that these packets missed the
> filters... but i'm willing to do it if you think that'll help).

Not necessary as long as you can tell where the packets end up.

> Anyway, I suspect the problem to be something else - I suspect that
> the packets sent using tcpreplay simply skip the filters in the kernel
> and are being injected somewhere afterwards. But this theory is
> problematic since I find it strange that the packets do end up in the
> default queue after all - hence they ARE seen by tc and they don't
> skip tc entirely.

I am not sure off top of my head why that would happen. I will try later
to install tcpreplay and reproduce your test.

cheers,
jamal


^ permalink raw reply

* Re: [Bugme-new] [Bug 38102] New: BUG kmalloc-2048: Poison overwritten
From: Eric Dumazet @ 2011-07-04 14:00 UTC (permalink / raw)
  To: Alexey Zaytsev
  Cc: Andrew Morton, netdev, Gary Zambrano, bugme-daemon,
	David S. Miller, Pekka Pietikainen, Florian Schirmer,
	Felix Fietkau, Michael Buesch
In-Reply-To: <CAB9v_DG7e8vDE+4PDwuOR2DYG1FEvUM1fxa+e4a=swwTYGJ9nQ@mail.gmail.com>

Le lundi 04 juillet 2011 à 15:48 +0400, Alexey Zaytsev a écrit :

> 
> This might fix a potential problem, but unfortunately did not help here.
> 
> There is an other place that looks suspicious to me:
> 
>  812                         struct sk_buff *copy_skb;
>  813
>  814                         b44_recycle_rx(bp, cons, bp->rx_prod);
>  815                         copy_skb = netdev_alloc_skb(bp->dev, len + 2);
>  816                         if (copy_skb == NULL)
>  817                                 goto drop_it_no_recycle;
>  818
>  819                         skb_reserve(copy_skb, 2);
>  820                         skb_put(copy_skb, len);
>  821                         /* DMA sync done above, copy just the
> actual packet */
>  822                         skb_copy_from_linear_data_offset(skb,
> RX_PKT_OFFSET,
>  823
> copy_skb->data, len);
>  824                         skb = copy_skb;
> 
> 
> The skb is reinserted into the ring before its data is copied, it
> seems. But this can't be the cause of my problem, as it would lead to
> data corruption at most, not a write-after-free.
> 
> And an other question. Why so we have the logic to work-around the 1Gb
> DMA limit instead of just setting the dma mask?

Your problem is in RX side : NIC actually writes to a buffer that is
supposedly not its property.

If DMA workaround is triggered, all frames are copied, so bug has no
chance to trigger, because we feed a totally new frame to upper stack,
and keep reuse 'DMA' frames for the device itself.





^ permalink raw reply

* Re: [Bugme-new] [Bug 38102] New: BUG kmalloc-2048: Poison overwritten
From: Eric Dumazet @ 2011-07-04 13:57 UTC (permalink / raw)
  To: Michael Büsch
  Cc: Alexey Zaytsev, Andrew Morton, netdev, Gary Zambrano,
	bugme-daemon, David S. Miller, Pekka Pietikainen,
	Florian Schirmer, Felix Fietkau, Michael Buesch
In-Reply-To: <20110704130531.37cf876e@Nokia-N900>

Le lundi 04 juillet 2011 à 13:05 +0000, Michael Büsch a écrit :
> On Mon, 4 Jul 2011 15:48:31 +0400
> Alexey Zaytsev <alexey.zaytsev@gmail.com> wrote:
> > The skb is reinserted into the ring before its data is copied, it
> > seems. But this can't be the cause of my problem, as it would lead to
> > data corruption at most, not a write-after-free.
> 
> Recycling the skb does not imply that the device can reuse it immediately. The device is told at the very end of the RX function (after the loop) that it's now safe to put stuff into the recyceled/new buffers.
> 
> > And an other question. Why so we have the logic to work-around the 1Gb
> > DMA limit instead of just setting the dma mask?
> 
> Because the DMA mask does not work correctly on all arches for masks smaller than 4G.
> 
> And btw, I dont understand what that wmb() patch is supposed to fix. There may be a wmb() missing, but rather after the ctrl _and_ the address assignment to the descriptor.
> But I don't think this can cause this use-after-free anyway.
> 

I dont have the b44 specs, but :

For sure, addr should be set before ctl, just in case ctl allows chip to
start a dma transfert (to previous packet), because a OWN bit is unset
for example...

A second wmb() is not necessary.
It will be done eventually at next packet (we have a ring of 200
packets)




^ permalink raw reply

* [PATCHv2] sctp: Enforce retransmission limit during shutdown
From: Thomas Graf @ 2011-07-04 13:50 UTC (permalink / raw)
  To: Vladislav Yasevich
  Cc: netdev, davem, Wei Yongjun, Sridhar Samudrala, linux-sctp
In-Reply-To: <4E0C8368.5090502@hp.com>

When initiating a graceful shutdown while having data chunks
on the retransmission queue with a peer which is in zero
window mode the shutdown is never completed because the
retransmission error count is reset periodically by the
following two rules:

 - Do not timeout association while doing zero window probe.
 - Reset overall error count when a heartbeat request has
   been acknowledged.

The graceful shutdown will wait for all outstanding TSN to
be acknowledged before sending the SHUTDOWN request. This
never happens due to the peer's zero window not acknowledging
the continuously retransmitted data chunks. Although the
error counter is incremented for each failed retransmission,
the receiving of the SACK announcing the zero window clears
the error count again immediately. Also heartbeat requests
continue to be sent periodically. The peer acknowledges these
requests causing the error counter to be reset as well.

This patch changes behaviour to only reset the overall error
counter for the above rules while not in shutdown. After
reaching the maximum number of retransmission attempts, the
T5 shutdown guard timer is scheduled to give the receiver
some additional time to recover. The timer is stopped as soon
as the receiver acknowledges any data.

The issue can be easily reproduced by establishing a sctp
association over the loopback device, constantly queueing
data at the sender while not reading any at the receiver.
Wait for the window to reach zero, then initiate a shutdown
by killing both processes simultaneously. The association
will never be freed and the chunks on the retransmission
queue will be retransmitted indefinitely.

Signed-off-by: Thomas Graf <tgraf@infradead.org>

diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 1c88c89..0ae911f 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -1582,6 +1582,9 @@ static void sctp_check_transmitted(struct sctp_outq *q,
 #endif /* SCTP_DEBUG */
 	if (transport) {
 		if (bytes_acked) {
+			struct sctp_association *asoc = transport->asoc;
+			struct timer_list *t;
+
 			/* We may have counted DATA that was migrated
 			 * to this transport due to DEL-IP operation.
 			 * Subtract those bytes, since the were never
@@ -1600,6 +1603,17 @@ static void sctp_check_transmitted(struct sctp_outq *q,
 			transport->error_count = 0;
 			transport->asoc->overall_error_count = 0;
 
+			/*
+			 * While in SHUTDOWN PENDING, we may have started
+			 * the T5 shutdown guard timer after reaching the
+			 * retransmission limit. Stop that timer as soon
+			 * as the receiver acknowledged any data.
+			 */
+			t = &asoc->timers[SCTP_EVENT_TIMEOUT_T5_SHUTDOWN_GUARD];
+			if (asoc->state == SCTP_STATE_SHUTDOWN_PENDING &&
+			    timer_pending(t) && del_timer(t))
+				sctp_association_put(asoc);
+
 			/* Mark the destination transport address as
 			 * active if it is not so marked.
 			 */
@@ -1629,10 +1643,15 @@ static void sctp_check_transmitted(struct sctp_outq *q,
 			 * A sender is doing zero window probing when the
 			 * receiver's advertised window is zero, and there is
 			 * only one data chunk in flight to the receiver.
+			 *
+			 * Allow the association to timeout if SHUTDOWN is
+			 * pending in case the receiver stays in zero window
+			 * mode forever.
 			 */
 			if (!q->asoc->peer.rwnd &&
 			    !list_empty(&tlist) &&
-			    (sack_ctsn+2 == q->asoc->next_tsn)) {
+			    (sack_ctsn+2 == q->asoc->next_tsn) &&
+			    !(q->asoc->state >= SCTP_STATE_SHUTDOWN_PENDING)) {
 				SCTP_DEBUG_PRINTK("%s: SACK received for zero "
 						  "window probe: %u\n",
 						  __func__, sack_ctsn);
diff --git a/net/sctp/sm_sideeffect.c b/net/sctp/sm_sideeffect.c
index 534c2e5..fa92f4d6 100644
--- a/net/sctp/sm_sideeffect.c
+++ b/net/sctp/sm_sideeffect.c
@@ -670,10 +670,21 @@ static void sctp_cmd_transport_on(sctp_cmd_seq_t *cmds,
 	/* 8.3 Upon the receipt of the HEARTBEAT ACK, the sender of the
 	 * HEARTBEAT should clear the error counter of the destination
 	 * transport address to which the HEARTBEAT was sent.
-	 * The association's overall error count is also cleared.
 	 */
 	t->error_count = 0;
-	t->asoc->overall_error_count = 0;
+
+	/*
+	 * Although RFC2960 and RFC4460 specify that the overall error
+	 * count must be cleared when a HEARTBEAT ACK is received this
+	 * behaviour may prevent the maximum retransmission count from
+	 * being reached while in SHUTDOWN. If the peer keeps its window
+	 * closed not acknowledging any outstanding TSN we may rely on
+	 * reaching the max_retrans limit via the T3-rtx timer to close
+	 * the association which will never happen if the error count is
+	 * reset every heartbeat interval.
+	 */
+	if (!(t->asoc->state >= SCTP_STATE_SHUTDOWN_PENDING))
+		t->asoc->overall_error_count = 0;
 
 	/* Clear the hb_sent flag to signal that we had a good
 	 * acknowledgement.
diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
index a297283..e6a0c35 100644
--- a/net/sctp/sm_statefuns.c
+++ b/net/sctp/sm_statefuns.c
@@ -5154,7 +5154,7 @@ sctp_disposition_t sctp_sf_do_9_2_start_shutdown(
 	 * The sender of the SHUTDOWN MAY also start an overall guard timer
 	 * 'T5-shutdown-guard' to bound the overall time for shutdown sequence.
 	 */
-	sctp_add_cmd_sf(commands, SCTP_CMD_TIMER_START,
+	sctp_add_cmd_sf(commands, SCTP_CMD_TIMER_RESTART,
 			SCTP_TO(SCTP_EVENT_TIMEOUT_T5_SHUTDOWN_GUARD));
 
 	if (asoc->autoclose)
@@ -5299,14 +5299,28 @@ sctp_disposition_t sctp_sf_do_6_3_3_rtx(const struct sctp_endpoint *ep,
 	SCTP_INC_STATS(SCTP_MIB_T3_RTX_EXPIREDS);
 
 	if (asoc->overall_error_count >= asoc->max_retrans) {
-		sctp_add_cmd_sf(commands, SCTP_CMD_SET_SK_ERR,
-				SCTP_ERROR(ETIMEDOUT));
-		/* CMD_ASSOC_FAILED calls CMD_DELETE_TCB. */
-		sctp_add_cmd_sf(commands, SCTP_CMD_ASSOC_FAILED,
-				SCTP_PERR(SCTP_ERROR_NO_ERROR));
-		SCTP_INC_STATS(SCTP_MIB_ABORTEDS);
-		SCTP_DEC_STATS(SCTP_MIB_CURRESTAB);
-		return SCTP_DISPOSITION_DELETE_TCB;
+		if (asoc->state == SCTP_STATE_SHUTDOWN_PENDING) {
+			/*
+			 * We are here likely because the receiver had its rwnd
+			 * closed for a while and we have not been able to
+			 * transmit the locally queued data within the maximum
+			 * retransmission attempts limit.  Start the T5
+			 * shutdown guard timer to give the receiver one last
+			 * chance and some additional time to recover before
+			 * aborting.
+			 */
+			sctp_add_cmd_sf(commands, SCTP_CMD_TIMER_RESTART,
+				SCTP_TO(SCTP_EVENT_TIMEOUT_T5_SHUTDOWN_GUARD));
+		} else {
+			sctp_add_cmd_sf(commands, SCTP_CMD_SET_SK_ERR,
+					SCTP_ERROR(ETIMEDOUT));
+			/* CMD_ASSOC_FAILED calls CMD_DELETE_TCB. */
+			sctp_add_cmd_sf(commands, SCTP_CMD_ASSOC_FAILED,
+					SCTP_PERR(SCTP_ERROR_NO_ERROR));
+			SCTP_INC_STATS(SCTP_MIB_ABORTEDS);
+			SCTP_DEC_STATS(SCTP_MIB_CURRESTAB);
+			return SCTP_DISPOSITION_DELETE_TCB;
+		}
 	}
 
 	/* E1) For the destination address for which the timer
diff --git a/net/sctp/sm_statetable.c b/net/sctp/sm_statetable.c
index 0338dc6..7c211a7 100644
--- a/net/sctp/sm_statetable.c
+++ b/net/sctp/sm_statetable.c
@@ -827,7 +827,7 @@ static const sctp_sm_table_entry_t other_event_table[SCTP_NUM_OTHER_TYPES][SCTP_
 	/* SCTP_STATE_ESTABLISHED */ \
 	TYPE_SCTP_FUNC(sctp_sf_timer_ignore), \
 	/* SCTP_STATE_SHUTDOWN_PENDING */ \
-	TYPE_SCTP_FUNC(sctp_sf_timer_ignore), \
+	TYPE_SCTP_FUNC(sctp_sf_t5_timer_expire), \
 	/* SCTP_STATE_SHUTDOWN_SENT */ \
 	TYPE_SCTP_FUNC(sctp_sf_t5_timer_expire), \
 	/* SCTP_STATE_SHUTDOWN_RECEIVED */ \

^ permalink raw reply related

* Re: libpcap and tc filters
From: Adam Katz @ 2011-07-04 13:24 UTC (permalink / raw)
  To: jhs; +Cc: netdev
In-Reply-To: <1309784740.26180.21.camel@mojatatu>

ok, I checked now and the packets sent by tcpreplay are identical to
the ones captured originally by wireshark.

I'm using the stock ubuntu 10.04 kernel that wasn't compiled with
CONFIG_CLS_U32_PERF so sudo tc -s filter ls dev eth1 shows nothing
useful (and i'm not sure that recompiling the entire kernel is worth
it to tell me what I already know - that these packets missed the
filters... but i'm willing to do it if you think that'll help).
Anyway, I suspect the problem to be something else - I suspect that
the packets sent using tcpreplay simply skip the filters in the kernel
and are being injected somewhere afterwards. But this theory is
problematic since I find it strange that the packets do end up in the
default queue after all - hence they ARE seen by tc and they don't
skip tc entirely.

On Mon, Jul 4, 2011 at 4:05 PM, jamal <hadi@cyberus.ca> wrote:
> On Mon, 2011-07-04 at 15:37 +0300, Adam Katz wrote:
>> here's a more concrete example:
>>
>> An example configuration:
>>
>>       sudo tc qdisc add dev eth1 root handle 1: prio priomap 2 2 2 2 2 2 2
>> 2 2 2 2 2 2 2 2 2
>>       sudo tc qdisc add dev eth1 parent 1:1 handle 10: pfifo
>>       sudo tc qdisc add dev eth1 parent 1:2 handle 20: pfifo
>>       sudo tc qdisc add dev eth1 parent 1:3 handle 30: pfifo
>>       sudo tc filter add dev eth1 protocol ip parent 1: prio 1 u32 match ip
>> dport 22 0xffff flowid 1:1
>>       sudo tc filter add dev eth1 protocol ip parent 1: prio 1 u32 match ip
>> sport 22 0xffff flowid 1:1
>>       sudo tc filter add dev eth1 protocol ip parent 1: prio 2 u32 match ip
>> dport 80 0xffff flowid 1:2
>>       sudo tc filter add dev eth1 protocol ip parent 1: prio 2 u32 match ip
>> sport 80 0xffff flowid 1:2
>
>
> looks fine.
>
>> I then used scp to copy a small file between computers while capturing
>> with wireshark:
>>
>> http://dl.dropbox.com/u/3237005/port22example.pcap
>>
>> and later I replayed the same capture using tcpreplay.
>> When using scp, the packets once again ended up where they should be
>> (1:1 in this configuration). With tcpreplay they ended up in the
>> default 1:3
>
> Where is the capture from tcpreplay? What i was asking is you validate
> that the capture before and what is sent out by tcprelay look the same.
> Can you please do that?
> It is possible because your filters are not matched they end up on your
> default queue based on tos value.
>
> If you have your kernel compiled with CONFIG_CLS_U32_PERF you should
> see when the filters get hit as well
> (do something like sudo tc -s filter ls dev eth1)
>
>
> cheers,
> jamal
>
>
>
>

^ permalink raw reply

* Re: libpcap and tc filters
From: jamal @ 2011-07-04 13:05 UTC (permalink / raw)
  To: Adam Katz; +Cc: netdev
In-Reply-To: <CAA0qwj7cH8Ah69fBMkXpGkwG77TH_ZqMhKwj-Cc8Vc=F5c9SSw@mail.gmail.com>

On Mon, 2011-07-04 at 15:37 +0300, Adam Katz wrote:
> here's a more concrete example:
> 
> An example configuration:
> 
> 	sudo tc qdisc add dev eth1 root handle 1: prio priomap 2 2 2 2 2 2 2
> 2 2 2 2 2 2 2 2 2
> 	sudo tc qdisc add dev eth1 parent 1:1 handle 10: pfifo 	
> 	sudo tc qdisc add dev eth1 parent 1:2 handle 20: pfifo
> 	sudo tc qdisc add dev eth1 parent 1:3 handle 30: pfifo
> 	sudo tc filter add dev eth1 protocol ip parent 1: prio 1 u32 match ip
> dport 22 0xffff flowid 1:1
> 	sudo tc filter add dev eth1 protocol ip parent 1: prio 1 u32 match ip
> sport 22 0xffff flowid 1:1
> 	sudo tc filter add dev eth1 protocol ip parent 1: prio 2 u32 match ip
> dport 80 0xffff flowid 1:2
> 	sudo tc filter add dev eth1 protocol ip parent 1: prio 2 u32 match ip
> sport 80 0xffff flowid 1:2


looks fine.

> I then used scp to copy a small file between computers while capturing
> with wireshark:
> 
> http://dl.dropbox.com/u/3237005/port22example.pcap
> 
> and later I replayed the same capture using tcpreplay.
> When using scp, the packets once again ended up where they should be
> (1:1 in this configuration). With tcpreplay they ended up in the
> default 1:3

Where is the capture from tcpreplay? What i was asking is you validate
that the capture before and what is sent out by tcprelay look the same.
Can you please do that?
It is possible because your filters are not matched they end up on your
default queue based on tos value.

If you have your kernel compiled with CONFIG_CLS_U32_PERF you should
see when the filters get hit as well
(do something like sudo tc -s filter ls dev eth1)


cheers,
jamal




^ permalink raw reply

* Re: [Bugme-new] [Bug 38102] New: BUG kmalloc-2048: Poison overwritten
From: Michael Büsch @ 2011-07-04 13:05 UTC (permalink / raw)
  To: Alexey Zaytsev
  Cc: Eric Dumazet, Andrew Morton, netdev, Gary Zambrano, bugme-daemon,
	David S. Miller, Pekka Pietikainen, Florian Schirmer,
	Felix Fietkau, Michael Buesch
In-Reply-To: <CAB9v_DG7e8vDE+4PDwuOR2DYG1FEvUM1fxa+e4a=swwTYGJ9nQ@mail.gmail.com>

On Mon, 4 Jul 2011 15:48:31 +0400
Alexey Zaytsev <alexey.zaytsev@gmail.com> wrote:
> The skb is reinserted into the ring before its data is copied, it
> seems. But this can't be the cause of my problem, as it would lead to
> data corruption at most, not a write-after-free.

Recycling the skb does not imply that the device can reuse it immediately. The device is told at the very end of the RX function (after the loop) that it's now safe to put stuff into the recyceled/new buffers.

> And an other question. Why so we have the logic to work-around the 1Gb
> DMA limit instead of just setting the dma mask?

Because the DMA mask does not work correctly on all arches for masks smaller than 4G.

And btw, I dont understand what that wmb() patch is supposed to fix. There may be a wmb() missing, but rather after the ctrl _and_ the address assignment to the descriptor.
But I don't think this can cause this use-after-free anyway.

^ permalink raw reply

* [PATCH] net/core: Make urgent data inline by default
From: Esa-Pekka Pyokkimies @ 2011-07-04 12:51 UTC (permalink / raw)
  To: netdev

Make urgent data inline by default. As explained in RFC 6093, urgent
data should never be handled out-of-band.

"The TCP urgent mechanism is NOT a mechanism for sending "out-of-band"
  data: the so-called "urgent data" should be delivered "in-line" to
  the TCP user."

Signed-off-by: Esa-Pekka Pyokkimies <esa-pekka.pyokkimies@stonesoft.com>

---
  net/core/sock.c |    1 +
  1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 6e81978..83234bd 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1985,6 +1985,7 @@ void sock_init_data(struct socket *sock, struct sock  
*sk)
         sk_set_socket(sk, sock);

         sock_set_flag(sk, SOCK_ZAPPED);
+	sock_set_flag(sk, SOCK_URGINLINE);

         if (sock) {
                 sk->sk_type     =       sock->type;

^ permalink raw reply related

* Re: libpcap and tc filters
From: Adam Katz @ 2011-07-04 12:37 UTC (permalink / raw)
  To: jhs; +Cc: netdev
In-Reply-To: <CAA0qwj70-qUQ+6NL=2LP05TyMN99MdL1BF8tko+v=DhdyjZjJQ@mail.gmail.com>

here's a more concrete example:

An example configuration:

	sudo tc qdisc add dev eth1 root handle 1: prio priomap 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2
	sudo tc qdisc add dev eth1 parent 1:1 handle 10: pfifo 	
	sudo tc qdisc add dev eth1 parent 1:2 handle 20: pfifo
	sudo tc qdisc add dev eth1 parent 1:3 handle 30: pfifo

	sudo tc filter add dev eth1 protocol ip parent 1: prio 1 u32 match ip
dport 22 0xffff flowid 1:1
	sudo tc filter add dev eth1 protocol ip parent 1: prio 1 u32 match ip
sport 22 0xffff flowid 1:1
	sudo tc filter add dev eth1 protocol ip parent 1: prio 2 u32 match ip
dport 80 0xffff flowid 1:2
	sudo tc filter add dev eth1 protocol ip parent 1: prio 2 u32 match ip
sport 80 0xffff flowid 1:2

I then used scp to copy a small file between computers while capturing
with wireshark:

http://dl.dropbox.com/u/3237005/port22example.pcap

and later I replayed the same capture using tcpreplay.
When using scp, the packets once again ended up where they should be
(1:1 in this configuration). With tcpreplay they ended up in the
default 1:3

On Mon, Jul 4, 2011 at 3:06 PM, jamal <hadi@cyberus.ca> wrote:
> On Mon, 2011-07-04 at 15:00 +0300, Adam Katz wrote:
>> Ok, I just tried this:
>>
>> I've opened www.example.com using a browser while capturing with
>> wireshark. TC placed all port 80 packets into the 1:1 as required. I
>> then physically plugged the nic to a different socket, one that isn't
>> connected to the internet (i did this so I wont get any server
>> responses to the packets i'm sending) and then I replayed the capture
>> of me opening www.example.com
>>
>> The second time, none of the packets ended up in 1:1 - they all went
>> to the default class despite being the EXACT same traffic.
>>
>
> Please post a small sample of the tcpdumps and the tc rules you used.
>
> cheers,
> jamal
>
>



On Mon, Jul 4, 2011 at 3:01 PM, Adam Katz <adamkatz0@gmail.com> wrote:
> Ok, I just tried this:
>
> I've opened www.example.com using a browser while capturing with
> wireshark. TC placed all port 80 packets into the 1:1 as required. I
> then physically plugged the nic to a different socket, one that isn't
> connected to the internet (i did this so I wont get any server
> responses to the packets i'm sending) and then I replayed the capture
> of me opening www.example.com
>
> The second time, none of the packets ended up in 1:1 - they all went
> to the default class despite being the EXACT same traffic.
>
> On Mon, Jul 4, 2011 at 2:11 PM, jamal <hadi@cyberus.ca> wrote:
>>
>> Capture tcpdump for both scenario that works and one
>> that doesnt.
>> Make sure the filters match your failing scenario.
>>
>> cheers,
>> jamal
>>
>> On Mon, 2011-07-04 at 10:38 +0300, Adam Katz wrote:
>>> Hi Everyone
>>>
>>> I'm sorry for littering the mailing list with this question, but no
>>> other place could help me..
>>>
>>> I'm attempting to use tc to shape traffic sent using libpcap, I'm
>>> doing this for a research project. I have a classful scheduler with
>>> several classes, to this scheduler I attach a few filters based on
>>> destination tcp ports.
>>>
>>> My problem is this: When sending packets using a normal userland
>>> socket, the filters work and I see the appropriate traffic entering
>>> the right class. BUT when sending packets with libpcap, all packets
>>> end up in the scheduler's default band as if the filters simply refuse
>>> to work.
>>>
>>> Can anyone suggest a solution?
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>

^ permalink raw reply

* RE: bnx2: FTQ dump on heavy workload(bnx2-2.0.23b + kernel 2.6.32.36)
From: MaoXiaoyun @ 2011-07-04 12:20 UTC (permalink / raw)
  To: netdev; +Cc: mchan
In-Reply-To: <BLU157-w1411DC0D9D19FE3A6935F5DA5C0@phx.gbl>


Could it be caused by the similar timeout as http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c441b8d2cb2194b05550a558d6d95d8944e56a84.
 
Maybe timeout still happens in my test scenerino.
 
Well, from the patch, BNX2_MISC_ECO_HW_CTL is defined 0x000008cc. But I cannot find 
the defines in programmer reference Guide.(NetXtremeII-PG203-R.pdf). Could some help
to point out for me or is the doc is out of date.
 
Also, is there a way to comfirm whether the timeout really happen? (which regisiter 
shall I read?) Or is there a bigger timeout I can set?
 
thanks.

----------------------------------------
> From: tinnycloud@hotmail.com
> To: netdev@vger.kernel.org
> Subject: bnx2: FTQ dump on heavy workload(bnx2-2.0.23b + kernel 2.6.32.36)
> Date: Mon, 4 Jul 2011 15:40:01 +0800
>
>
> Hi:
>
> I met bnx2 FTQ dump over and over again during my testing on Xen live migration which generate
> heavy network workload.
>
> I have two physcial machine, both have xen 4.0.1 installed, and kernel 2.6.32.36, bnx2 2.0.23b.
> I start 15 Virtual Machines totoally, and doing migration between the host over and over again,
> about 16hours, the network will not work, and sometimes, it can reset successfully, sometimes, it
> cause kernel crash.
>
> I've tried debug some, add code in the driver. below is the code when FTQ happened.
> It looks like the NIC is stop transmit the packets, and cause timeout.
>
> BTW, cpu max_cstate=1 in my grub.
>
> Thanks.
>
> --------------
> static void
> bnx2_tx_timeout(struct net_device *dev)
> {
> struct bnx2 *bp = netdev_priv(dev);
> struct bnx2_napi *bnapi = &bp->bnx2_napi[0];
> struct bnx2_tx_ring_info *txr = &bnapi->tx_ring;
> struct bnx2_rx_ring_info *rxr = &bnapi->rx_ring;
> int i ;
> bnx2_dump_ftq(bp);
> bnx2_dump_state(bp);
> if (stop_on_tx_timeout) {
> printk(KERN_WARNING PFX
> "%s: prevent chip reset during tx timeout\n",
> bp->dev->name);
> smp_rmb();
> printk("last status idx %d \n", bnapi->last_status_idx);
> printk("hw_tx_cons %d, txr->hw_tx_conds %d txr->tx_prod %d txr->tx_cons %d\n",
> bnx2_get_hw_tx_cons(bnapi), txr->hw_tx_cons, txr->tx_prod, txr->tx_cons);
> printk("hw_rx_cons %d, txr->hw_rx_conds %d\n", bnx2_get_hw_rx_cons(bnapi), rxr->rx_cons);
> printk("sblk->status_attn_bits %d\n",bnapi->status_blk.msi->status_attn_bits);
> printk("sblk->status_attn_bits_ack %d\n",bnapi->status_blk.msi->status_attn_bits_ack);
> printk("bnx2_tx_avail %d \n",(bnx2_tx_avail(bp, txr)));
> printk("sblk->status_tx_quick_consumer_index0 %d\n",bnapi->status_blk.msi->status_tx_quick_consumer_index0);
> printk("sblk->status_tx_quick_consumer_index1 %d\n",bnapi->status_blk.msi->status_tx_quick_consumer_index1);
> printk("sblk->status_tx_quick_consumer_index2 %d\n",bnapi->status_blk.msi->status_tx_quick_consumer_index2);
> printk("sblk->status_tx_quick_consumer_index3 %d\n",bnapi->status_blk.msi->status_tx_quick_consumer_index3);
> printk("sblk->status_rx_quick_consumer_index0 %d\n",bnapi->status_blk.msi->status_rx_quick_consumer_index0);
> printk("sblk->status_rx_quick_consumer_index1 %d\n",bnapi->status_blk.msi->status_rx_quick_consumer_index1);
> printk("sblk->status_rx_quick_consumer_index2 %d\n",bnapi->status_blk.msi->status_rx_quick_consumer_index2);
> printk("sblk->status_rx_quick_consumer_index3 %d\n",bnapi->status_blk.msi->status_rx_quick_consumer_index3);
> printk("sblk->status_rx_quick_consumer_index4 %d\n",bnapi->status_blk.msi->status_rx_quick_consumer_index4);
> printk("sblk->status_rx_quick_consumer_index5 %d\n",bnapi->status_blk.msi->status_rx_quick_consumer_index5);
> printk("sblk->status_rx_quick_consumer_index6 %d\n",bnapi->status_blk.msi->status_rx_quick_consumer_index6);
> printk("sblk->status_rx_quick_consumer_index7 %d\n",bnapi->status_blk.msi->status_rx_quick_consumer_index7);
> printk("sblk->status_rx_quick_consumer_index8 %d\n",bnapi->status_blk.msi->status_rx_quick_consumer_index8);
> printk("sblk->status_rx_quick_consumer_index9 %d\n",bnapi->status_blk.msi->status_rx_quick_consumer_index9);
> printk("sblk->status_rx_quick_consumer_index10 %d\n",bnapi->status_blk.msi->status_rx_quick_consumer_index10);
> printk("sblk->status_rx_quick_consumer_index11 %d\n",bnapi->status_blk.msi->status_rx_quick_consumer_index11);
> printk("sblk->status_rx_quick_consumer_index12 %d\n",bnapi->status_blk.msi->status_rx_quick_consumer_index12);
> printk("sblk->status_rx_quick_consumer_index13 %d\n",bnapi->status_blk.msi->status_rx_quick_consumer_index13);
> printk("sblk->status_rx_quick_consumer_index14 %d\n",bnapi->status_blk.msi->status_rx_quick_consumer_index14);
> printk("sblk->status_rx_quick_consumer_index15 %d\n",bnapi->status_blk.msi->status_rx_quick_consumer_index15);
> printk("sblk->status_completion_producer_index %d\n",bnapi->status_blk.msi->status_completion_producer_index);
> printk("sblk->status_cmd_consumer_index %d\n",bnapi->status_blk.msi->status_cmd_consumer_index);
> printk("sblk->status_idx %d\n",bnapi->status_blk.msi->status_idx);
> printk("sblk->status_unused %d\n",bnapi->status_blk.msi->status_unused);
> printk("sblk->status_blk_num %d\n",bnapi->status_blk.msi->status_blk_num);
> is_timedout = 1;
> for (i = 0; i < bp->irq_nvecs; i++) {
> bnapi = &bp->bnx2_napi[i];
> bnx2_tx_int(bp, bnapi, 0);
> }
> return;
> }
> -----------------
>
> -------------FTQ log in /var/log/message
> ------------[ cut here ]------------
> WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x105/0x16a()
> Hardware name: Tecal RH2285
> Modules linked in: iptable_filter ip_tables nfs fscache nfs_acl auth_rpcgss bridge stp llc autofs4 ipmi_devintf ipmi_si ipmi_msghandler lockd sunrpc ipv6 xenfs dm_multipath fuse xen_netback xen_blkback blktap blkback_pagemap loop nbd video output sbs sbshc parport_pc lp parport snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss bnx2 serio_raw snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 iTCO_wdt iTCO_vendor_support i2c_core pata_acpi ata_generic pcspkr ata_piix shpchp mptsas mptscsih mptbase [last unloaded: freq_table]
> Pid: 0, comm: swapper Not tainted 2.6.32.36xen #1
> Call Trace:
> <IRQ> [<ffffffff813ba154>] ? dev_watchdog+0x105/0x16a
> [<ffffffff81056666>] warn_slowpath_common+0x7c/0x94
> [<ffffffff81056738>] warn_slowpath_fmt+0xa4/0xa6
> [<ffffffff81080bfa>] ? clockevents_program_event+0x78/0x81
> [<ffffffff81081fce>] ? tick_program_event+0x2a/0x2c
> [<ffffffff813b951d>] ? __netif_tx_lock+0x1b/0x24
> [<ffffffff813b95a8>] ? netif_tx_lock+0x46/0x6e
> [<ffffffff813a3ed1>] ? netdev_drivername+0x48/0x4f
> [<ffffffff813ba154>] dev_watchdog+0x105/0x16a
> [<ffffffff81063d98>] run_timer_softirq+0x156/0x1f8
> [<ffffffff813ba04f>] ? dev_watchdog+0x0/0x16a
> [<ffffffff8105d6f0>] __do_softirq+0xd7/0x19e
> [<ffffffff81013eac>] call_softirq+0x1c/0x30
> [<ffffffff8101564b>] do_softirq+0x46/0x87
> [<ffffffff8105d575>] irq_exit+0x3b/0x7a
> [<ffffffff8128dcfe>] xen_evtchn_do_upcall+0x38/0x46
> [<ffffffff81013efe>] xen_do_hypervisor_callback+0x1e/0x30
> <EOI> [<ffffffff8103f642>] ? pick_next_task_idle+0x18/0x22
> [<ffffffff810093aa>] ? hypercall_page+0x3aa/0x1000
> [<ffffffff810093aa>] ? hypercall_page+0x3aa/0x1000
> [<ffffffff8100f1bb>] ? xen_safe_halt+0x10/0x1a
> [<ffffffff81019e14>] ? default_idle+0x39/0x56
> [<ffffffff81011cd0>] ? cpu_idle+0x5d/0x8c
> [<ffffffff8143375d>] ? cpu_bringup_and_idle+0x13/0x15
> ---[ end trace 867bb8f6cd959b03 ]---
> bnx2: <--- start FTQ dump on peth0 --->
> bnx2: peth0: BNX2_RV2P_PFTQ_CTL 10000
> bnx2: peth0: BNX2_RV2P_TFTQ_CTL 20000
> bnx2: peth0: BNX2_RV2P_MFTQ_CTL 4000
> bnx2: peth0: BNX2_TBDR_FTQ_CTL 1004002
> bnx2: peth0: BNX2_TDMA_FTQ_CTL 4010002
> bnx2: peth0: BNX2_TXP_FTQ_CTL 2410002
> bnx2: peth0: BNX2_TPAT_FTQ_CTL 10002
> bnx2: peth0: BNX2_RXP_CFTQ_CTL 8000
> bnx2: peth0: BNX2_RXP_FTQ_CTL 100000
> bnx2: peth0: BNX2_COM_COMXQ_FTQ_CTL 10000
> bnx2: peth0: BNX2_COM_COMTQ_FTQ_CTL 20000
> bnx2: peth0: BNX2_COM_COMQ_FTQ_CTL 10000
> bnx2: peth0: BNX2_CP_CPQ_FTQ_CTL 4000
> bnx2: peth0: TXP mode b84c state 80005000 evt_mask 500 pc 8000d60 pc 8000d60 instr 8f860000
> bnx2: peth0: TPAT mode b84c state 80009000 evt_mask 500 pc 8000a5c pc 8000a5c instr 10400016
> bnx2: peth0: RXP mode b84c state 80001000 evt_mask 500 pc 8004c14 pc 8004c14 instr 10e00088
> bnx2: peth0: COM mode b8cc state 80000000 evt_mask 500 pc 8000b28 pc 8000a9c instr 8c530000
> bnx2: peth0: CP mode b8cc state 80000000 evt_mask 500 pc 8000c50 pc 8000c58 instr 8ca50020
> bnx2: <--- end FTQ dump on peth0 --->
> bnx2: peth0 DEBUG: intr_sem[0]
> bnx2: peth0 DEBUG: intr_sem[0] PCI_CMD[20100406]
> bnx2: peth0 DEBUG: PCI_PM[19002008] PCI_MISC_CFG[92000088]
> bnx2: peth0 DEBUG: EMAC_TX_STATUS[00000008] EMAC_RX_STATUS[00000000]
> bnx2: peth0 RPM_MGMT_PKT_CTRL[40000088]
> bnx2: peth0 DEBUG: MCP_STATE_P0[0007e10e] MCP_STATE_P1[0003e00e]
> bnx2: peth0 DEBUG: HC_STATS_INTERRUPT_STATUS[01ff0000]
> bnx2: peth0 DEBUG: PBA[00000000]
> BNX2_PCICFG_INT_ACK_CMD[00013ce1]
> bnx2: peth0: prevent chip reset during tx timeout
> last status idx 2426
> hw_tx_cons 32474, txr->hw_tx_conds 32474 txr->tx_prod 32641 txr->tx_cons 32474
> hw_rx_cons 19665, txr->hw_rx_conds 19665
> sblk->status_attn_bits 1
> sblk->status_attn_bits_ack 1
> bnx2_tx_avail 88
> sblk->status_tx_quick_consumer_index0 32474
> sblk->status_tx_quick_consumer_index1 0
> sblk->status_tx_quick_consumer_index2 0
> sblk->status_tx_quick_consumer_index3 0
> sblk->status_rx_quick_consumer_index0 19665
> sblk->status_rx_quick_consumer_index1 0
> sblk->status_rx_quick_consumer_index2 0
> sblk->status_rx_quick_consumer_index3 0
> sblk->status_rx_quick_consumer_index4 0
> sblk->status_rx_quick_consumer_index5 0
> sblk->status_rx_quick_consumer_index6 0
> sblk->status_rx_quick_consumer_index7 0
> sblk->status_rx_quick_consumer_index8 0
> sblk->status_rx_quick_consumer_index9 0
> sblk->status_rx_quick_consumer_index10 0
> sblk->status_rx_quick_consumer_index11 0
> sblk->status_rx_quick_consumer_index12 0
> sblk->status_rx_quick_consumer_index13 0
> sblk->status_rx_quick_consumer_index14 0
> sblk->status_rx_quick_consumer_index15 0
> sblk->status_completion_producer_index 0
> sblk->status_cmd_consumer_index 0
> sblk->status_idx 2426
> sblk->status_unused 0
> sblk->status_blk_num 0
> hw_cons 32474 sw_cons 32474 ffff8801d27f85c0 bnapi
> return hw_cons 32474 sw_cons 32474 ffff8801d27f85c0 bnapi
> hw_cons 3628 sw_cons 3625 ffff8801d27f8bc0 bnapi
> return hw_cons 3628 sw_cons 3625 ffff8801d27f8bc0 bnapi
> hw_cons 62094 sw_cons 62090 ffff8801d27f91c0 bnapi
> return hw_cons 62094 sw_cons 62090 ffff8801d27f91c0 bnapi
> hw_cons 3184 sw_cons 3173 ffff8801d27f97c0 bnapi
> return hw_cons 3184 sw_cons 3173 ffff8801d27f97c0 bnapi
> hw_cons 0 sw_cons 0 ffff8801d27f9dc0 bnapi
> return hw_cons 0 sw_cons 0 ffff8801d27f9dc0 bnapi 		 	   		  

^ permalink raw reply

* Re: libpcap and tc filters
From: Adam Katz @ 2011-07-04 12:01 UTC (permalink / raw)
  To: jhs; +Cc: netdev
In-Reply-To: <1309777908.26180.1.camel@mojatatu>

Ok, I just tried this:

I've opened www.example.com using a browser while capturing with
wireshark. TC placed all port 80 packets into the 1:1 as required. I
then physically plugged the nic to a different socket, one that isn't
connected to the internet (i did this so I wont get any server
responses to the packets i'm sending) and then I replayed the capture
of me opening www.example.com

The second time, none of the packets ended up in 1:1 - they all went
to the default class despite being the EXACT same traffic.

On Mon, Jul 4, 2011 at 2:11 PM, jamal <hadi@cyberus.ca> wrote:
>
> Capture tcpdump for both scenario that works and one
> that doesnt.
> Make sure the filters match your failing scenario.
>
> cheers,
> jamal
>
> On Mon, 2011-07-04 at 10:38 +0300, Adam Katz wrote:
>> Hi Everyone
>>
>> I'm sorry for littering the mailing list with this question, but no
>> other place could help me..
>>
>> I'm attempting to use tc to shape traffic sent using libpcap, I'm
>> doing this for a research project. I have a classful scheduler with
>> several classes, to this scheduler I attach a few filters based on
>> destination tcp ports.
>>
>> My problem is this: When sending packets using a normal userland
>> socket, the filters work and I see the appropriate traffic entering
>> the right class. BUT when sending packets with libpcap, all packets
>> end up in the scheduler's default band as if the filters simply refuse
>> to work.
>>
>> Can anyone suggest a solution?
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>

^ permalink raw reply

* Re: [Bugme-new] [Bug 38102] New: BUG kmalloc-2048: Poison overwritten
From: Alexey Zaytsev @ 2011-07-04 11:48 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, netdev, Gary Zambrano, bugme-daemon,
	David S. Miller, Pekka Pietikainen, Florian Schirmer,
	Felix Fietkau, Michael Buesch
In-Reply-To: <1309707971.2523.20.camel@edumazet-laptop>

On Sun, Jul 3, 2011 at 19:46, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le dimanche 03 juillet 2011 à 01:25 +0400, Alexey Zaytsev a écrit :
>> On Fri, Jul 1, 2011 at 10:01, Alexey Zaytsev <alexey.zaytsev@gmail.com> wrote:
>> > On Thu, Jun 30, 2011 at 01:51, Andrew Morton <akpm@linux-foundation.org> wrote:
>> >>
>> >> (switched to email.  Please respond via emailed reply-to-all, not via the
>> >> bugzilla web interface).
>> >>
>> >> On Thu, 23 Jun 2011 17:33:54 GMT
>> >> bugzilla-daemon@bugzilla.kernel.org wrote:
>> >>
>> >>> https://bugzilla.kernel.org/show_bug.cgi?id=38102
>> >>>
>> >>>            Summary: BUG kmalloc-2048: Poison overwritten
>> >>>            Product: Drivers
>> >>>            Version: 2.5
>> >>>     Kernel Version: 3.0.0-rc4
>> >>
>> >> Looks like a 2.6.38->2.6.39 regression, perhaps a memory scribble in b44.
>> >
>> > Actually, not sure about the version. 39 was the first one I've been
>> > using in the scenario. Checking older versions now.
>> > And git-log does not show a lot of changes to the b44 driver, so it
>> > might be something unrelated.
>> >
>>
>> I've checked back as far as 2.6.27, and the problem is still there.
>> I've also looked through the allocation-related code, and it seemed
>> sane. I'm not sure I understand the 1GB dma workaround, but this path
>> is never hit in my case. So adding the driver authors to CC. This
>> could be something different, but I've been unable to reproduce using
>> an other machine with an rtl8139 nic.
>
> Hmm, looking at b44 code, I believe there is a race there.
>
> Could you try following patch ?
>

This might fix a potential problem, but unfortunately did not help here.

There is an other place that looks suspicious to me:

 812                         struct sk_buff *copy_skb;
 813
 814                         b44_recycle_rx(bp, cons, bp->rx_prod);
 815                         copy_skb = netdev_alloc_skb(bp->dev, len + 2);
 816                         if (copy_skb == NULL)
 817                                 goto drop_it_no_recycle;
 818
 819                         skb_reserve(copy_skb, 2);
 820                         skb_put(copy_skb, len);
 821                         /* DMA sync done above, copy just the
actual packet */
 822                         skb_copy_from_linear_data_offset(skb,
RX_PKT_OFFSET,
 823
copy_skb->data, len);
 824                         skb = copy_skb;


The skb is reinserted into the ring before its data is copied, it
seems. But this can't be the cause of my problem, as it would lead to
data corruption at most, not a write-after-free.

And an other question. Why so we have the logic to work-around the 1Gb
DMA limit instead of just setting the dma mask?

^ permalink raw reply

* [PATCH] net: bind() fix error return on wrong address family
From: Marcus Meissner @ 2011-07-04 11:30 UTC (permalink / raw)
  To: davem, kuznet, pekkas, jmorris, yoshfuji, kaber, netdev,
	linux-kernel
  Cc: Marcus Meissner, Marcus Meissner, Reinhard Max

Hi,

Reinhard Max also pointed out that the error should EAFNOSUPPORT according
to POSIX.

The Linux manpages have it as EINVAL, some other OSes (Minix, HPUX, perhaps BSD) use
EAFNOSUPPORT. Windows uses WSAEFAULT according to MSDN.

Other protocols error values in their af bind() methods in current mainline git as far
as a brief look shows:
	EAFNOSUPPORT: atm, appletalk, l2tp, llc, phonet, rxrpc
	EINVAL: ax25, bluetooth, decnet, econet, ieee802154, iucv, netlink, netrom, packet, rds, rose, unix, x25, 
	No check?: can/raw, ipv6/raw, irda, l2tp/l2tp_ip

Ciao, Marcus

Signed-off-by: Marcus Meissner <meissner@suse.de>
Cc: Reinhard Max <max@suse.de>
---
 net/ipv4/af_inet.c  |    4 +++-
 net/ipv6/af_inet6.c |    2 +-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index eae1f67..ef1528a 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -465,8 +465,10 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 	if (addr_len < sizeof(struct sockaddr_in))
 		goto out;
 
-	if (addr->sin_family != AF_INET)
+	if (addr->sin_family != AF_INET) {
+		err = -EAFNOSUPPORT;
 		goto out;
+	}
 
 	chk_addr_ret = inet_addr_type(sock_net(sk), addr->sin_addr.s_addr);
 
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index d450a2f..3b5669a 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -274,7 +274,7 @@ int inet6_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 		return -EINVAL;
 
 	if (addr->sin6_family != AF_INET6)
-		return -EINVAL;
+		return -EAFNOSUPPORT;
 
 	addr_type = ipv6_addr_type(&addr->sin6_addr);
 	if ((addr_type & IPV6_ADDR_MULTICAST) && sock->type == SOCK_STREAM)
-- 
1.7.5.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox