* [PATCH net-next] net: rename sk_clone to sk_clone_lock
From: Eric Dumazet @ 2011-11-08 22:00 UTC (permalink / raw)
To: David Miller; +Cc: netdev
Make clear that sk_clone() and inet_csk_clone() return a locked socket.
Add _lock() prefix and kerneldoc.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
include/net/inet_connection_sock.h | 6 +++---
include/net/sock.h | 4 ++--
net/core/sock.c | 11 +++++++++--
net/dccp/minisocks.c | 2 +-
net/ipv4/inet_connection_sock.c | 17 +++++++++++++----
net/ipv4/tcp_minisocks.c | 2 +-
6 files changed, 29 insertions(+), 13 deletions(-)
diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index e6db62e..dbf9aab 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -143,9 +143,9 @@ static inline void *inet_csk_ca(const struct sock *sk)
return (void *)inet_csk(sk)->icsk_ca_priv;
}
-extern struct sock *inet_csk_clone(struct sock *sk,
- const struct request_sock *req,
- const gfp_t priority);
+extern struct sock *inet_csk_clone_lock(const struct sock *sk,
+ const struct request_sock *req,
+ const gfp_t priority);
enum inet_csk_ack_state_t {
ICSK_ACK_SCHED = 1,
diff --git a/include/net/sock.h b/include/net/sock.h
index abb6e0f..67cd458 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1089,8 +1089,8 @@ extern struct sock *sk_alloc(struct net *net, int family,
struct proto *prot);
extern void sk_free(struct sock *sk);
extern void sk_release_kernel(struct sock *sk);
-extern struct sock *sk_clone(const struct sock *sk,
- const gfp_t priority);
+extern struct sock *sk_clone_lock(const struct sock *sk,
+ const gfp_t priority);
extern struct sk_buff *sock_wmalloc(struct sock *sk,
unsigned long size, int force,
diff --git a/net/core/sock.c b/net/core/sock.c
index 4ed7b1d..e419061 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1204,7 +1204,14 @@ void sk_release_kernel(struct sock *sk)
}
EXPORT_SYMBOL(sk_release_kernel);
-struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
+/**
+ * sk_clone_lock - clone a socket, and lock its clone
+ * @sk: the socket to clone
+ * @priority: for allocation (%GFP_KERNEL, %GFP_ATOMIC, etc)
+ *
+ * Caller must unlock socket even in error path (bh_unlock_sock(newsk))
+ */
+struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
{
struct sock *newsk;
@@ -1297,7 +1304,7 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
out:
return newsk;
}
-EXPORT_SYMBOL_GPL(sk_clone);
+EXPORT_SYMBOL_GPL(sk_clone_lock);
void sk_setup_caps(struct sock *sk, struct dst_entry *dst)
{
diff --git a/net/dccp/minisocks.c b/net/dccp/minisocks.c
index d7041a0..563b7c7 100644
--- a/net/dccp/minisocks.c
+++ b/net/dccp/minisocks.c
@@ -100,7 +100,7 @@ struct sock *dccp_create_openreq_child(struct sock *sk,
* (* Generate a new socket and switch to that socket *)
* Set S := new socket for this port pair
*/
- struct sock *newsk = inet_csk_clone(sk, req, GFP_ATOMIC);
+ struct sock *newsk = inet_csk_clone_lock(sk, req, GFP_ATOMIC);
if (newsk != NULL) {
struct dccp_request_sock *dreq = dccp_rsk(req);
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index c14d88a..a598768 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -588,10 +588,19 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
}
EXPORT_SYMBOL_GPL(inet_csk_reqsk_queue_prune);
-struct sock *inet_csk_clone(struct sock *sk, const struct request_sock *req,
- const gfp_t priority)
+/**
+ * inet_csk_clone_lock - clone an inet socket, and lock its clone
+ * @sk: the socket to clone
+ * @req: request_sock
+ * @priority: for allocation (%GFP_KERNEL, %GFP_ATOMIC, etc)
+ *
+ * Caller must unlock socket even in error path (bh_unlock_sock(newsk))
+ */
+struct sock *inet_csk_clone_lock(const struct sock *sk,
+ const struct request_sock *req,
+ const gfp_t priority)
{
- struct sock *newsk = sk_clone(sk, priority);
+ struct sock *newsk = sk_clone_lock(sk, priority);
if (newsk != NULL) {
struct inet_connection_sock *newicsk = inet_csk(newsk);
@@ -615,7 +624,7 @@ struct sock *inet_csk_clone(struct sock *sk, const struct request_sock *req,
}
return newsk;
}
-EXPORT_SYMBOL_GPL(inet_csk_clone);
+EXPORT_SYMBOL_GPL(inet_csk_clone_lock);
/*
* At this point, there should be no process reference to this
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 66363b6..0a7e339 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -425,7 +425,7 @@ static inline void TCP_ECN_openreq_child(struct tcp_sock *tp,
*/
struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req, struct sk_buff *skb)
{
- struct sock *newsk = inet_csk_clone(sk, req, GFP_ATOMIC);
+ struct sock *newsk = inet_csk_clone_lock(sk, req, GFP_ATOMIC);
if (newsk != NULL) {
const struct inet_request_sock *ireq = inet_rsk(req);
^ permalink raw reply related
* Re: [3.1] Divide by zero in __tcp_select_window()
From: Eric Dumazet @ 2011-11-08 21:23 UTC (permalink / raw)
To: Simon Kirby
Cc: Thomas Gleixner, Network Development, David Miller,
Peter Zijlstra, Linux Kernel Mailing List, Dave Jones,
Martin Schwidefsky, Ingo Molnar
In-Reply-To: <20111108205411.GA23642@hostway.ca>
Le mardi 08 novembre 2011 à 12:54 -0800, Simon Kirby a écrit :
> Hello!
>
> We've seen a few more hang traces with tcp_keeaplive_timer() since we
> fixed the socket unlocking, so I was thinking that I must have borked
> the patching, but we finally caught it on a box with a serial console,
> and it turned out to be a different problem (pasted below).
>
> This is on the same boxes as before (random web loads), running
> 3.1+Thomas and Eric's socket locking fixes, still with some anti-abuse
> scripts that use blackhole routes.
>
> By the way, Greg seems to have only pulled Thomas' patch into queue-3.1,
> not Eric's, so I think that may have been missed.
>
> The only divide in __tcp_select_window is the one that lines up the
> window to a multiple of the mss:
>
> /* Get the largest window that is a nice multiple of mss.
> * Window clamp already applied above.
> * If our current window offering is within 1 mss of the
> * free space we just keep it. This prevents the divide
> * and multiply from happening most of the time.
> * We also don't do any window rounding when the free space
> * is too small.
> */
> if (window <= free_space - mss || window > free_space)
> window = (free_space / mss) * mss;
> else if (mss == full_space &&
> free_space > window + (full_space >> 1))
> window = free_space;
>
> None of that code has changed for years, so I suspect something else has
> changed to cause mss to be zero by the time it got here.
>
> Simon-
>
> divide error: 0000 [#1] SMP
> CPU 2
> Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler xt_recent nf_conntrack_ftp xt_state xt_owner nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 bnx2
>
> Pid: 25125, comm: php4 Not tainted 3.1.0-hw-lockdep+ #57 Dell Inc. PowerEdge 1950/0UR033
> RIP: 0010:[<ffffffff81660199>] [<ffffffff81660199>] __tcp_select_window+0xe9/0x130
> RSP: 0000:ffff88022fc83cd0 EFLAGS: 00010246
> RAX: 0000000000003908 RBX: ffff88005032eccc RCX: 000000000000ffff
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000003908
> RBP: ffff88022fc83cd0 R08: 0000000000003908 R09: 0000000000007fff
> R10: 0000000000000014 R11: 0000000000000000 R12: ffff88006a1f9200
> R13: ffff880101757f00 R14: 0000000000003908 R15: 0000000000000014
> FS: 00007f7634433720(0000) GS:ffff88022fc80000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 000000000478c0a8 CR3: 000000013cbd0000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process php4 (pid: 25125, threadinfo ffff88016249a000, task ffff88003b243e40)
> Stack:
> ffff88022fc83d50 ffffffff816606e9 ffffffff81660de2 ffff880101757f28
> 0000000000000100 0000000000000100 0000000000000000 0000000000000000
> 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> Call Trace:
> <IRQ>
> [<ffffffff816606e9>] tcp_transmit_skb+0x209/0x8e0
> [<ffffffff81660de2>] ? tcp_xmit_probe_skb+0x22/0xe0
> [<ffffffff81660e8e>] tcp_xmit_probe_skb+0xce/0xe0
> [<ffffffff8166252e>] tcp_write_wakeup+0x6e/0x180
> [<ffffffff81664897>] tcp_keepalive_timer+0x247/0x270
> [<ffffffff8106cd8d>] run_timer_softirq+0x26d/0x410
> [<ffffffff8106ccb8>] ? run_timer_softirq+0x198/0x410
> [<ffffffff81664650>] ? tcp_init_xmit_timers+0x20/0x20
> [<ffffffff810640e8>] __do_softirq+0x138/0x250
> [<ffffffff817030fc>] call_softirq+0x1c/0x30
> [<ffffffff810153c5>] do_softirq+0x95/0xd0
> [<ffffffff81063cbd>] irq_exit+0xdd/0x110
> [<ffffffff810310c9>] smp_apic_timer_interrupt+0x69/0xa0
> [<ffffffff81701973>] apic_timer_interrupt+0x73/0x80
> <EOI>
> [<ffffffff81700eca>] ? sysret_check+0x2e/0x69
> Code: 89 d0 d3 e0 41 39 c0 74 56 8d 7a 01 d3 e7 89 f8 c9 c3 89 c7 44 89 c0 29 f0 39 f8 7d 05 44 39 c7 7e 30 44 89 c2 44 89 c0 c1 fa 1f <f7> fe 89 c7 0f af fe 89 f8 eb da 0f 1f 40 00 f7 d9 41 89 d0 89
> RIP [<ffffffff81660199>] __tcp_select_window+0xe9/0x130
> RSP <ffff88022fc83cd0>
> divide error: 0000 [#2]
> ---[ end trace b3f46c09a69a2efe ]---
> Kernel panic - not syncing: Fatal exception in interrupt
> Pid: 25125, comm: php4 Tainted: G D 3.1.0-hw-lockdep+ #57
> Call Trace:
> <IRQ> [<ffffffff816f4364>] panic+0xca/0x210
> [<ffffffff8105cd24>] ? kmsg_dump+0x104/0x160
> [<ffffffff8105cd3c>] ? kmsg_dump+0x11c/0x160
> [<ffffffff8105cc9c>] ? kmsg_dump+0x7c/0x160
> [<ffffffff816fa25c>] oops_end+0xdc/0xf0
> [<ffffffff810166d6>] die+0x56/0x90
> [<ffffffff816f9b8e>] do_trap+0x14e/0x170
> [<ffffffff8101474a>] do_divide_error+0x8a/0xb0
> [<ffffffff81660199>] ? __tcp_select_window+0xe9/0x130
> [<ffffffff81613a53>] ? __netif_receive_skb+0x383/0x560
> [<ffffffff81613835>] ? __netif_receive_skb+0x165/0x560
> [<ffffffff813a614d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> [<ffffffff816f91a4>] ? restore_args+0x30/0x30
> [<ffffffff81702e1b>] divide_error+0x1b/0x20
> [<ffffffff81660199>] ? __tcp_select_window+0xe9/0x130
> [<ffffffff816606e9>] tcp_transmit_skb+0x209/0x8e0
> [<ffffffff81660de2>] ? tcp_xmit_probe_skb+0x22/0xe0
> [<ffffffff81660e8e>] tcp_xmit_probe_skb+0xce/0xe0
> [<ffffffff8166252e>] tcp_write_wakeup+0x6e/0x180
> [<ffffffff81664897>] tcp_keepalive_timer+0x247/0x270
> [<ffffffff8106cd8d>] run_timer_softirq+0x26d/0x410
> [<ffffffff8106ccb8>] ? run_timer_softirq+0x198/0x410
> [<ffffffff81664650>] ? tcp_init_xmit_timers+0x20/0x20
> [<ffffffff810640e8>] __do_softirq+0x138/0x250
> [<ffffffff817030fc>] call_softirq+0x1c/0x30
> [<ffffffff810153c5>] do_softirq+0x95/0xd0
> [<ffffffff81063cbd>] irq_exit+0xdd/0x110
> [<ffffffff810310c9>] smp_apic_timer_interrupt+0x69/0xa0
> [<ffffffff81701973>] apic_timer_interrupt+0x73/0x80
> <EOI> [<ffffffff81700eca>] ? sysret_check+0x2e/0x69
> SMP
> CPU 3
> Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler xt_recent nf_conntrack_ftp xt_state xt_owner nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 bnx2
>
> Pid: 0, comm: kworker/0:1 Tainted: G D 3.1.0-hw-lockdep+ #57 Dell Inc. PowerEdge 1950/0UR033
> RIP: 0010:[<ffffffff81660199>] [<ffffffff81660199>] __tcp_select_window+0xe9/0x130
> RSP: 0018:ffff88022fcc3cd0 EFLAGS: 00010246
> RAX: 0000000000003908 RBX: ffff880010ce80cc RCX: 000000000000ffff
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000003908
> RBP: ffff88022fcc3cd0 R08: 0000000000003908 R09: 0000000000007fff
> R10: 0000000000000014 R11: 0000000000000000 R12: ffff88006636a400
> R13: ffff88000afb0300 R14: 0000000000003908 R15: 0000000000000014
> FS: 0000000000000000(0000) GS:ffff88022fcc0000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00007fc220c14000 CR3: 000000013cbd0000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process kworker/0:1 (pid: 0, threadinfo ffff880226980000, task ffff880226959f20)
> Stack:
> ffff88022fcc3d50 ffffffff816606e9 ffffffff81660de2 ffff88000afb0328
> 0000000000000100 0000000000000100 0000000000000000 0000000000000000
> 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> Call Trace:
> <IRQ>
> [<ffffffff816606e9>] tcp_transmit_skb+0x209/0x8e0
> [<ffffffff81660de2>] ? tcp_xmit_probe_skb+0x22/0xe0
> [<ffffffff81660e8e>] tcp_xmit_probe_skb+0xce/0xe0
> [<ffffffff8166252e>] tcp_write_wakeup+0x6e/0x180
> [<ffffffff81664897>] tcp_keepalive_timer+0x247/0x270
> [<ffffffff8106cd8d>] run_timer_softirq+0x26d/0x410
> [<ffffffff8106ccb8>] ? run_timer_softirq+0x198/0x410
> [<ffffffff81664650>] ? tcp_init_xmit_timers+0x20/0x20
> [<ffffffff810640e8>] __do_softirq+0x138/0x250
> [<ffffffff817030fc>] call_softirq+0x1c/0x30
> [<ffffffff810153c5>] do_softirq+0x95/0xd0
> [<ffffffff81063cbd>] irq_exit+0xdd/0x110
> [<ffffffff810310c9>] smp_apic_timer_interrupt+0x69/0xa0
> [<ffffffff81701973>] apic_timer_interrupt+0x73/0x80
> <EOI>
> [<ffffffff8101b80e>] ? mwait_idle+0x14e/0x170
> [<ffffffff8101b805>] ? mwait_idle+0x145/0x170
> [<ffffffff81013156>] cpu_idle+0x96/0xf0
> [<ffffffff816ef30b>] start_secondary+0x1ca/0x1ff
> Code: 89 d0 d3 e0 41 39 c0 74 56 8d 7a 01 d3 e7 89 f8 c9 c3 89 c7 44 89 c0 29 f0 39 f8 7d 05 44 39 c7 7e 30 44 89 c2 44 89 c0 c1 fa 1f <f7> fe 89 c7 0f af fe 89 f8 eb da 0f 1f 40 00 f7 d9 41 89 d0 89
> RIP [<ffffffff81660199>] __tcp_select_window+0xe9/0x130
> RSP <ffff88022fcc3cd0>
> divide error: 0000 [#3] SMP
> CPU 1
> Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler xt_recent nf_conntrack_ftp xt_state xt_owner nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 bnx2
>
> Pid: 25118, comm: search.pl Tainted: G D 3.1.0-hw-lockdep+ #57 Dell Inc. PowerEdge 1950/0UR033
> RIP: 0010:[<ffffffff81660199>] [<ffffffff81660199>] __tcp_select_window+0xe9/0x130
> RSP: 0018:ffff88022fc43cd0 EFLAGS: 00010246
> RAX: 0000000000003908 RBX: ffff88003d808ccc RCX: 000000000000ffff
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000003908
> RBP: ffff88022fc43cd0 R08: 0000000000003908 R09: 0000000000007fff
> R10: 0000000000000014 R11: 0000000000000000 R12: ffff88001966da00
> R13: ffff8800a4665500 R14: 0000000000003908 R15: 0000000000000014
> FS: 00007f0872f48700(0000) GS:ffff88022fc40000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000003f0f470 CR3: 000000015e1ad000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process search.pl (pid: 25118, threadinfo ffff880109992000, task ffff880214f40000)
> Stack:
> ffff88022fc43d50 ffffffff816606e9 ffffffff81660de2 ffff8800a4665528
> 0000000000000100 0000000000000100 0000000000000000 0000000000000000
> 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> Call Trace:
> <IRQ>
> [<ffffffff816606e9>] tcp_transmit_skb+0x209/0x8e0
> [<ffffffff81660de2>] ? tcp_xmit_probe_skb+0x22/0xe0
> [<ffffffff81660e8e>] tcp_xmit_probe_skb+0xce/0xe0
> [<ffffffff8166252e>] tcp_write_wakeup+0x6e/0x180
> [<ffffffff81664897>] tcp_keepalive_timer+0x247/0x270
> [<ffffffff8106cd8d>] run_timer_softirq+0x26d/0x410
> [<ffffffff8106ccb8>] ? run_timer_softirq+0x198/0x410
> [<ffffffff81664650>] ? tcp_init_xmit_timers+0x20/0x20
> [<ffffffff810640e8>] __do_softirq+0x138/0x250
> [<ffffffff817030fc>] call_softirq+0x1c/0x30
> [<ffffffff810153c5>] do_softirq+0x95/0xd0
> [<ffffffff81063cbd>] irq_exit+0xdd/0x110
> [<ffffffff810310c9>] smp_apic_timer_interrupt+0x69/0xa0
> [<ffffffff81701973>] apic_timer_interrupt+0x73/0x80
> <EOI>
> [<ffffffff8109a802>] ? lock_acquire+0x122/0x140
> [<ffffffff8122bc80>] ? nfs_handle_cb_pathdown+0x20/0x20
> [<ffffffff8122bcc7>] nfs_have_delegation+0x47/0xb0
> [<ffffffff8122bc80>] ? nfs_handle_cb_pathdown+0x20/0x20
> [<ffffffff81205806>] nfs_attribute_cache_expired+0x16/0x70
> [<ffffffff81096c3d>] ? trace_hardirqs_on_caller+0x13d/0x1c0
> [<ffffffff81206e7f>] nfs_revalidate_mapping+0x2f/0x130
> [<ffffffff81204d1f>] nfs_file_read+0x7f/0x120
> [<ffffffff810529dc>] ? finish_task_switch+0x8c/0x100
> [<ffffffff8112aa91>] do_sync_read+0xd1/0x120
> [<ffffffff816f50f2>] ? __schedule+0x5c2/0xa20
> [<ffffffff8112b7b8>] vfs_read+0xc8/0x180
> [<ffffffff8112b960>] sys_read+0x50/0x90
> [<ffffffff81700e92>] system_call_fastpath+0x16/0x1b
> Code: 89 d0 d3 e0 41 39 c0 74 56 8d 7a 01 d3 e7 89 f8 c9 c3 89 c7 44 89 c0 29 f0 39 f8 7d 05 44 39 c7 7e 30 44 89 c2 44 89 c0 c1 fa 1f <f7> fe 89 c7 0f af fe 89 f8 eb da 0f 1f 40 00 f7 d9 41 89 d0 89
> RIP [<ffffffff81660199>] __tcp_select_window+0xe9/0x130
> RSP <ffff88022fc43cd0>
> divide error: 0000 [#4] SMP
> CPU 0
> Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler xt_recent nf_conntrack_ftp xt_state xt_owner nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 bnx2
>
> Pid: 25177, comm: php Tainted: G D 3.1.0-hw-lockdep+ #57 Dell Inc. PowerEdge 1950/0UR033
> RIP: 0010:[<ffffffff81660199>] [<ffffffff81660199>] __tcp_select_window+0xe9/0x130
> RSP: 0000:ffff88022fc03cd0 EFLAGS: 00010246
> RAX: 0000000000003908 RBX: ffff88001519b0cc RCX: 000000000000ffff
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000003908
> RBP: ffff88022fc03cd0 R08: 0000000000003908 R09: 0000000000007fff
> R10: 0000000000000014 R11: 0000000000000000 R12: ffff88000263a400
> R13: ffff88016545f300 R14: 0000000000003908 R15: 0000000000000014
> FS: 00007ffbdac9b720(0000) GS:ffff88022fc00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000003f97000 CR3: 0000000104c4c000 CR4: 00000000000006f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process php (pid: 25177, threadinfo ffff880214e7c000, task ffff880116940000)
> Stack:
> ffff88022fc03d50 ffffffff816606e9 ffffffff81660de2 ffff88016545f328
> 0000000000000100 0000000000000100 0000000000000000 0000000000000000
> 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> Call Trace:
> <IRQ>
> [<ffffffff816606e9>] tcp_transmit_skb+0x209/0x8e0
> [<ffffffff81660de2>] ? tcp_xmit_probe_skb+0x22/0xe0
> [<ffffffff81660e8e>] tcp_xmit_probe_skb+0xce/0xe0
> [<ffffffff8166252e>] tcp_write_wakeup+0x6e/0x180
> [<ffffffff81664897>] tcp_keepalive_timer+0x247/0x270
> [<ffffffff8106cd8d>] run_timer_softirq+0x26d/0x410
> [<ffffffff8106ccb8>] ? run_timer_softirq+0x198/0x410
> [<ffffffff81664650>] ? tcp_init_xmit_timers+0x20/0x20
> [<ffffffff810640e8>] __do_softirq+0x138/0x250
> [<ffffffff817030fc>] call_softirq+0x1c/0x30
> [<ffffffff810153c5>] do_softirq+0x95/0xd0
> [<ffffffff81063cbd>] irq_exit+0xdd/0x110
> [<ffffffff810310c9>] smp_apic_timer_interrupt+0x69/0xa0
> [<ffffffff81701973>] apic_timer_interrupt+0x73/0x80
> <EOI>
> [<ffffffff81700eca>] ? sysret_check+0x2e/0x69
> Code: 89 d0 d3 e0 41 39 c0 74 56 8d 7a 01 d3 e7 89 f8 c9 c3 89 c7 44 89 c0 29 f0 39 f8 7d 05 44 39 c7 7e 30 44 89 c2 44 89 c0 c1 fa 1f <f7> fe 89 c7 0f af fe 89 f8 eb da 0f 1f 40 00 f7 d9 41 89 d0 89
> RIP [<ffffffff81660199>] __tcp_select_window+0xe9/0x130
> RSP <ffff88022fc03cd0>
>
OK, it seems we let a timer running while we free the socket (same error
path than your previous bug report, because of the NULL route)
We arm this keepalive timer in tcp_create_openreq_child()
net/ipv4/tcp_minisocks.c:513
if (sock_flag(newsk, SOCK_KEEPOPEN))
inet_csk_reset_keepalive_timer(newsk,
keepalive_time_when(newtp));
I would try to add a call to tcp_clear_xmit_timers() as well
Please try following patch :
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a744315..a9db4b1 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1510,6 +1510,7 @@ exit:
NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
return NULL;
put_and_exit:
+ tcp_clear_xmit_timers(newsk);
bh_unlock_sock(newsk);
sock_put(newsk);
goto exit;
^ permalink raw reply related
* [3.1] Divide by zero in __tcp_select_window()
From: Simon Kirby @ 2011-11-08 20:54 UTC (permalink / raw)
To: Eric Dumazet, Thomas Gleixner, Network Development
Cc: David Miller, Peter Zijlstra, Linux Kernel Mailing List,
Dave Jones, Martin Schwidefsky, Ingo Molnar
Hello!
We've seen a few more hang traces with tcp_keeaplive_timer() since we
fixed the socket unlocking, so I was thinking that I must have borked
the patching, but we finally caught it on a box with a serial console,
and it turned out to be a different problem (pasted below).
This is on the same boxes as before (random web loads), running
3.1+Thomas and Eric's socket locking fixes, still with some anti-abuse
scripts that use blackhole routes.
By the way, Greg seems to have only pulled Thomas' patch into queue-3.1,
not Eric's, so I think that may have been missed.
The only divide in __tcp_select_window is the one that lines up the
window to a multiple of the mss:
/* Get the largest window that is a nice multiple of mss.
* Window clamp already applied above.
* If our current window offering is within 1 mss of the
* free space we just keep it. This prevents the divide
* and multiply from happening most of the time.
* We also don't do any window rounding when the free space
* is too small.
*/
if (window <= free_space - mss || window > free_space)
window = (free_space / mss) * mss;
else if (mss == full_space &&
free_space > window + (full_space >> 1))
window = free_space;
None of that code has changed for years, so I suspect something else has
changed to cause mss to be zero by the time it got here.
Simon-
divide error: 0000 [#1] SMP
CPU 2
Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler xt_recent nf_conntrack_ftp xt_state xt_owner nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 bnx2
Pid: 25125, comm: php4 Not tainted 3.1.0-hw-lockdep+ #57 Dell Inc. PowerEdge 1950/0UR033
RIP: 0010:[<ffffffff81660199>] [<ffffffff81660199>] __tcp_select_window+0xe9/0x130
RSP: 0000:ffff88022fc83cd0 EFLAGS: 00010246
RAX: 0000000000003908 RBX: ffff88005032eccc RCX: 000000000000ffff
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000003908
RBP: ffff88022fc83cd0 R08: 0000000000003908 R09: 0000000000007fff
R10: 0000000000000014 R11: 0000000000000000 R12: ffff88006a1f9200
R13: ffff880101757f00 R14: 0000000000003908 R15: 0000000000000014
FS: 00007f7634433720(0000) GS:ffff88022fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000478c0a8 CR3: 000000013cbd0000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process php4 (pid: 25125, threadinfo ffff88016249a000, task ffff88003b243e40)
Stack:
ffff88022fc83d50 ffffffff816606e9 ffffffff81660de2 ffff880101757f28
0000000000000100 0000000000000100 0000000000000000 0000000000000000
0000000000000000 0000000000000000 0000000000000000 0000000000000000
Call Trace:
<IRQ>
[<ffffffff816606e9>] tcp_transmit_skb+0x209/0x8e0
[<ffffffff81660de2>] ? tcp_xmit_probe_skb+0x22/0xe0
[<ffffffff81660e8e>] tcp_xmit_probe_skb+0xce/0xe0
[<ffffffff8166252e>] tcp_write_wakeup+0x6e/0x180
[<ffffffff81664897>] tcp_keepalive_timer+0x247/0x270
[<ffffffff8106cd8d>] run_timer_softirq+0x26d/0x410
[<ffffffff8106ccb8>] ? run_timer_softirq+0x198/0x410
[<ffffffff81664650>] ? tcp_init_xmit_timers+0x20/0x20
[<ffffffff810640e8>] __do_softirq+0x138/0x250
[<ffffffff817030fc>] call_softirq+0x1c/0x30
[<ffffffff810153c5>] do_softirq+0x95/0xd0
[<ffffffff81063cbd>] irq_exit+0xdd/0x110
[<ffffffff810310c9>] smp_apic_timer_interrupt+0x69/0xa0
[<ffffffff81701973>] apic_timer_interrupt+0x73/0x80
<EOI>
[<ffffffff81700eca>] ? sysret_check+0x2e/0x69
Code: 89 d0 d3 e0 41 39 c0 74 56 8d 7a 01 d3 e7 89 f8 c9 c3 89 c7 44 89 c0 29 f0 39 f8 7d 05 44 39 c7 7e 30 44 89 c2 44 89 c0 c1 fa 1f <f7> fe 89 c7 0f af fe 89 f8 eb da 0f 1f 40 00 f7 d9 41 89 d0 89
RIP [<ffffffff81660199>] __tcp_select_window+0xe9/0x130
RSP <ffff88022fc83cd0>
divide error: 0000 [#2]
---[ end trace b3f46c09a69a2efe ]---
Kernel panic - not syncing: Fatal exception in interrupt
Pid: 25125, comm: php4 Tainted: G D 3.1.0-hw-lockdep+ #57
Call Trace:
<IRQ> [<ffffffff816f4364>] panic+0xca/0x210
[<ffffffff8105cd24>] ? kmsg_dump+0x104/0x160
[<ffffffff8105cd3c>] ? kmsg_dump+0x11c/0x160
[<ffffffff8105cc9c>] ? kmsg_dump+0x7c/0x160
[<ffffffff816fa25c>] oops_end+0xdc/0xf0
[<ffffffff810166d6>] die+0x56/0x90
[<ffffffff816f9b8e>] do_trap+0x14e/0x170
[<ffffffff8101474a>] do_divide_error+0x8a/0xb0
[<ffffffff81660199>] ? __tcp_select_window+0xe9/0x130
[<ffffffff81613a53>] ? __netif_receive_skb+0x383/0x560
[<ffffffff81613835>] ? __netif_receive_skb+0x165/0x560
[<ffffffff813a614d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[<ffffffff816f91a4>] ? restore_args+0x30/0x30
[<ffffffff81702e1b>] divide_error+0x1b/0x20
[<ffffffff81660199>] ? __tcp_select_window+0xe9/0x130
[<ffffffff816606e9>] tcp_transmit_skb+0x209/0x8e0
[<ffffffff81660de2>] ? tcp_xmit_probe_skb+0x22/0xe0
[<ffffffff81660e8e>] tcp_xmit_probe_skb+0xce/0xe0
[<ffffffff8166252e>] tcp_write_wakeup+0x6e/0x180
[<ffffffff81664897>] tcp_keepalive_timer+0x247/0x270
[<ffffffff8106cd8d>] run_timer_softirq+0x26d/0x410
[<ffffffff8106ccb8>] ? run_timer_softirq+0x198/0x410
[<ffffffff81664650>] ? tcp_init_xmit_timers+0x20/0x20
[<ffffffff810640e8>] __do_softirq+0x138/0x250
[<ffffffff817030fc>] call_softirq+0x1c/0x30
[<ffffffff810153c5>] do_softirq+0x95/0xd0
[<ffffffff81063cbd>] irq_exit+0xdd/0x110
[<ffffffff810310c9>] smp_apic_timer_interrupt+0x69/0xa0
[<ffffffff81701973>] apic_timer_interrupt+0x73/0x80
<EOI> [<ffffffff81700eca>] ? sysret_check+0x2e/0x69
SMP
CPU 3
Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler xt_recent nf_conntrack_ftp xt_state xt_owner nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 bnx2
Pid: 0, comm: kworker/0:1 Tainted: G D 3.1.0-hw-lockdep+ #57 Dell Inc. PowerEdge 1950/0UR033
RIP: 0010:[<ffffffff81660199>] [<ffffffff81660199>] __tcp_select_window+0xe9/0x130
RSP: 0018:ffff88022fcc3cd0 EFLAGS: 00010246
RAX: 0000000000003908 RBX: ffff880010ce80cc RCX: 000000000000ffff
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000003908
RBP: ffff88022fcc3cd0 R08: 0000000000003908 R09: 0000000000007fff
R10: 0000000000000014 R11: 0000000000000000 R12: ffff88006636a400
R13: ffff88000afb0300 R14: 0000000000003908 R15: 0000000000000014
FS: 0000000000000000(0000) GS:ffff88022fcc0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fc220c14000 CR3: 000000013cbd0000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/0:1 (pid: 0, threadinfo ffff880226980000, task ffff880226959f20)
Stack:
ffff88022fcc3d50 ffffffff816606e9 ffffffff81660de2 ffff88000afb0328
0000000000000100 0000000000000100 0000000000000000 0000000000000000
0000000000000000 0000000000000000 0000000000000000 0000000000000000
Call Trace:
<IRQ>
[<ffffffff816606e9>] tcp_transmit_skb+0x209/0x8e0
[<ffffffff81660de2>] ? tcp_xmit_probe_skb+0x22/0xe0
[<ffffffff81660e8e>] tcp_xmit_probe_skb+0xce/0xe0
[<ffffffff8166252e>] tcp_write_wakeup+0x6e/0x180
[<ffffffff81664897>] tcp_keepalive_timer+0x247/0x270
[<ffffffff8106cd8d>] run_timer_softirq+0x26d/0x410
[<ffffffff8106ccb8>] ? run_timer_softirq+0x198/0x410
[<ffffffff81664650>] ? tcp_init_xmit_timers+0x20/0x20
[<ffffffff810640e8>] __do_softirq+0x138/0x250
[<ffffffff817030fc>] call_softirq+0x1c/0x30
[<ffffffff810153c5>] do_softirq+0x95/0xd0
[<ffffffff81063cbd>] irq_exit+0xdd/0x110
[<ffffffff810310c9>] smp_apic_timer_interrupt+0x69/0xa0
[<ffffffff81701973>] apic_timer_interrupt+0x73/0x80
<EOI>
[<ffffffff8101b80e>] ? mwait_idle+0x14e/0x170
[<ffffffff8101b805>] ? mwait_idle+0x145/0x170
[<ffffffff81013156>] cpu_idle+0x96/0xf0
[<ffffffff816ef30b>] start_secondary+0x1ca/0x1ff
Code: 89 d0 d3 e0 41 39 c0 74 56 8d 7a 01 d3 e7 89 f8 c9 c3 89 c7 44 89 c0 29 f0 39 f8 7d 05 44 39 c7 7e 30 44 89 c2 44 89 c0 c1 fa 1f <f7> fe 89 c7 0f af fe 89 f8 eb da 0f 1f 40 00 f7 d9 41 89 d0 89
RIP [<ffffffff81660199>] __tcp_select_window+0xe9/0x130
RSP <ffff88022fcc3cd0>
divide error: 0000 [#3] SMP
CPU 1
Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler xt_recent nf_conntrack_ftp xt_state xt_owner nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 bnx2
Pid: 25118, comm: search.pl Tainted: G D 3.1.0-hw-lockdep+ #57 Dell Inc. PowerEdge 1950/0UR033
RIP: 0010:[<ffffffff81660199>] [<ffffffff81660199>] __tcp_select_window+0xe9/0x130
RSP: 0018:ffff88022fc43cd0 EFLAGS: 00010246
RAX: 0000000000003908 RBX: ffff88003d808ccc RCX: 000000000000ffff
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000003908
RBP: ffff88022fc43cd0 R08: 0000000000003908 R09: 0000000000007fff
R10: 0000000000000014 R11: 0000000000000000 R12: ffff88001966da00
R13: ffff8800a4665500 R14: 0000000000003908 R15: 0000000000000014
FS: 00007f0872f48700(0000) GS:ffff88022fc40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000003f0f470 CR3: 000000015e1ad000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process search.pl (pid: 25118, threadinfo ffff880109992000, task ffff880214f40000)
Stack:
ffff88022fc43d50 ffffffff816606e9 ffffffff81660de2 ffff8800a4665528
0000000000000100 0000000000000100 0000000000000000 0000000000000000
0000000000000000 0000000000000000 0000000000000000 0000000000000000
Call Trace:
<IRQ>
[<ffffffff816606e9>] tcp_transmit_skb+0x209/0x8e0
[<ffffffff81660de2>] ? tcp_xmit_probe_skb+0x22/0xe0
[<ffffffff81660e8e>] tcp_xmit_probe_skb+0xce/0xe0
[<ffffffff8166252e>] tcp_write_wakeup+0x6e/0x180
[<ffffffff81664897>] tcp_keepalive_timer+0x247/0x270
[<ffffffff8106cd8d>] run_timer_softirq+0x26d/0x410
[<ffffffff8106ccb8>] ? run_timer_softirq+0x198/0x410
[<ffffffff81664650>] ? tcp_init_xmit_timers+0x20/0x20
[<ffffffff810640e8>] __do_softirq+0x138/0x250
[<ffffffff817030fc>] call_softirq+0x1c/0x30
[<ffffffff810153c5>] do_softirq+0x95/0xd0
[<ffffffff81063cbd>] irq_exit+0xdd/0x110
[<ffffffff810310c9>] smp_apic_timer_interrupt+0x69/0xa0
[<ffffffff81701973>] apic_timer_interrupt+0x73/0x80
<EOI>
[<ffffffff8109a802>] ? lock_acquire+0x122/0x140
[<ffffffff8122bc80>] ? nfs_handle_cb_pathdown+0x20/0x20
[<ffffffff8122bcc7>] nfs_have_delegation+0x47/0xb0
[<ffffffff8122bc80>] ? nfs_handle_cb_pathdown+0x20/0x20
[<ffffffff81205806>] nfs_attribute_cache_expired+0x16/0x70
[<ffffffff81096c3d>] ? trace_hardirqs_on_caller+0x13d/0x1c0
[<ffffffff81206e7f>] nfs_revalidate_mapping+0x2f/0x130
[<ffffffff81204d1f>] nfs_file_read+0x7f/0x120
[<ffffffff810529dc>] ? finish_task_switch+0x8c/0x100
[<ffffffff8112aa91>] do_sync_read+0xd1/0x120
[<ffffffff816f50f2>] ? __schedule+0x5c2/0xa20
[<ffffffff8112b7b8>] vfs_read+0xc8/0x180
[<ffffffff8112b960>] sys_read+0x50/0x90
[<ffffffff81700e92>] system_call_fastpath+0x16/0x1b
Code: 89 d0 d3 e0 41 39 c0 74 56 8d 7a 01 d3 e7 89 f8 c9 c3 89 c7 44 89 c0 29 f0 39 f8 7d 05 44 39 c7 7e 30 44 89 c2 44 89 c0 c1 fa 1f <f7> fe 89 c7 0f af fe 89 f8 eb da 0f 1f 40 00 f7 d9 41 89 d0 89
RIP [<ffffffff81660199>] __tcp_select_window+0xe9/0x130
RSP <ffff88022fc43cd0>
divide error: 0000 [#4] SMP
CPU 0
Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler xt_recent nf_conntrack_ftp xt_state xt_owner nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 bnx2
Pid: 25177, comm: php Tainted: G D 3.1.0-hw-lockdep+ #57 Dell Inc. PowerEdge 1950/0UR033
RIP: 0010:[<ffffffff81660199>] [<ffffffff81660199>] __tcp_select_window+0xe9/0x130
RSP: 0000:ffff88022fc03cd0 EFLAGS: 00010246
RAX: 0000000000003908 RBX: ffff88001519b0cc RCX: 000000000000ffff
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000003908
RBP: ffff88022fc03cd0 R08: 0000000000003908 R09: 0000000000007fff
R10: 0000000000000014 R11: 0000000000000000 R12: ffff88000263a400
R13: ffff88016545f300 R14: 0000000000003908 R15: 0000000000000014
FS: 00007ffbdac9b720(0000) GS:ffff88022fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000003f97000 CR3: 0000000104c4c000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process php (pid: 25177, threadinfo ffff880214e7c000, task ffff880116940000)
Stack:
ffff88022fc03d50 ffffffff816606e9 ffffffff81660de2 ffff88016545f328
0000000000000100 0000000000000100 0000000000000000 0000000000000000
0000000000000000 0000000000000000 0000000000000000 0000000000000000
Call Trace:
<IRQ>
[<ffffffff816606e9>] tcp_transmit_skb+0x209/0x8e0
[<ffffffff81660de2>] ? tcp_xmit_probe_skb+0x22/0xe0
[<ffffffff81660e8e>] tcp_xmit_probe_skb+0xce/0xe0
[<ffffffff8166252e>] tcp_write_wakeup+0x6e/0x180
[<ffffffff81664897>] tcp_keepalive_timer+0x247/0x270
[<ffffffff8106cd8d>] run_timer_softirq+0x26d/0x410
[<ffffffff8106ccb8>] ? run_timer_softirq+0x198/0x410
[<ffffffff81664650>] ? tcp_init_xmit_timers+0x20/0x20
[<ffffffff810640e8>] __do_softirq+0x138/0x250
[<ffffffff817030fc>] call_softirq+0x1c/0x30
[<ffffffff810153c5>] do_softirq+0x95/0xd0
[<ffffffff81063cbd>] irq_exit+0xdd/0x110
[<ffffffff810310c9>] smp_apic_timer_interrupt+0x69/0xa0
[<ffffffff81701973>] apic_timer_interrupt+0x73/0x80
<EOI>
[<ffffffff81700eca>] ? sysret_check+0x2e/0x69
Code: 89 d0 d3 e0 41 39 c0 74 56 8d 7a 01 d3 e7 89 f8 c9 c3 89 c7 44 89 c0 29 f0 39 f8 7d 05 44 39 c7 7e 30 44 89 c2 44 89 c0 c1 fa 1f <f7> fe 89 c7 0f af fe 89 f8 eb da 0f 1f 40 00 f7 d9 41 89 d0 89
RIP [<ffffffff81660199>] __tcp_select_window+0xe9/0x130
RSP <ffff88022fc03cd0>
(gdb) disassemble /m __tcp_select_window
Dump of assembler code for function __tcp_select_window:
1895 {
0xffffffff81628ec0 <+0>: push %rbp
0xffffffff81628ed3 <+19>: mov %rsp,%rbp
1896 struct inet_connection_sock *icsk = inet_csk(sk);
1897 struct tcp_sock *tp = tcp_sk(sk);
1898 /* MSS for the peer's data. Previous versions used mss_clamp
1899 * here. I don't know if the value based on our guesses
1900 * of peer's MSS is better for the performance. It's more correct
1901 * but may be worse for the performance because of rcv_mss
1902 * fluctuations. --SAW 1998/11/1
1903 */
1904 int mss = icsk->icsk_ack.rcv_mss;
0xffffffff81628ed6 <+22>: movzwl 0x3c6(%rdi),%r9d
0xffffffff81628f06 <+70>: movzwl %r9w,%eax
0xffffffff81628f0d <+77>: mov %eax,%esi
0xffffffff81628f0f <+79>: cmp %eax,%ecx
0xffffffff81628f14 <+84>: cmovle %ecx,%esi
1905 int free_space = tcp_space(sk);
1906 int full_space = min_t(int, tp->window_clamp, tcp_full_space(sk));
0xffffffff81628eef <+47>: mov 0x4a8(%rdi),%edx
0xffffffff81628f02 <+66>: cmp %eax,%edx
0xffffffff81628f04 <+68>: mov %eax,%ecx
0xffffffff81628f0a <+74>: cmovle %edx,%ecx
0xffffffff81628fc2 <+258>: mov 0x4a8(%rdi),%edx
1907 int window;
1908
1909 if (mss > full_space)
1910 mss = full_space;
1911
1912 if (free_space < (full_space >> 1)) {
0xffffffff81628f11 <+81>: mov %ecx,%r9d
0xffffffff81628f17 <+87>: sar %r9d
0xffffffff81628f1a <+90>: cmp %r8d,%r9d
0xffffffff81628f1d <+93>: jle 0xffffffff81628f54 <__tcp_select_window+148>
1913 icsk->icsk_ack.quick = 0;
0xffffffff81628f1f <+95>: movb $0x0,0x3b1(%rdi)
1914
1915 if (tcp_memory_pressure)
0xffffffff81628f26 <+102>: mov 0x412633(%rip),%r10d # 0xffffffff81a3b560
0xffffffff81628f2d <+109>: test %r10d,%r10d
0xffffffff81628f30 <+112>: je 0xffffffff81628f4d <__tcp_select_window+141>
1916 tp->rcv_ssthresh = min(tp->rcv_ssthresh,
0xffffffff81628f32 <+114>: movzwl 0x4b4(%rdi),%edx
0xffffffff81628f39 <+121>: mov 0x4ac(%rdi),%eax
0xffffffff81628f3f <+127>: shl $0x2,%edx
0xffffffff81628f42 <+130>: cmp %eax,%edx
0xffffffff81628f44 <+132>: cmovbe %edx,%eax
0xffffffff81628f47 <+135>: mov %eax,0x4ac(%rdi)
1917 4U * tp->advmss);
1918
1919 if (free_space < mss)
0xffffffff81628f4d <+141>: xor %eax,%eax
0xffffffff81628f4f <+143>: cmp %esi,%r8d
0xffffffff81628f52 <+146>: jl 0xffffffff81628f8e <__tcp_select_window+206>
1920 return 0;
1921 }
1922
1923 if (free_space > tp->rcv_ssthresh)
0xffffffff81628f54 <+148>: mov 0x4ac(%rdi),%eax
1924 free_space = tp->rcv_ssthresh;
0xffffffff81628f61 <+161>: cmp %eax,%r8d
0xffffffff81628f64 <+164>: cmova %eax,%r8d
1925
1926 /* Don't do rounding if we are using window scaling, since the
1927 * scaled window will not line up with the MSS boundary anyway.
1928 */
1929 window = tp->rcv_wnd;
0xffffffff81628f6b <+171>: mov 0x518(%rdi),%eax
0xffffffff81628f90 <+208>: mov %eax,%edi
1930 if (tp->rx_opt.rcv_wscale) {
0xffffffff81628f5a <+154>: movzbl 0x4f5(%rdi),%edx
0xffffffff81628f68 <+168>: test $0xf0,%dl
0xffffffff81628f71 <+177>: je 0xffffffff81628f90 <__tcp_select_window+208>
1931 window = free_space;
1932
1933 /* Advertise enough space so that it won't get scaled away.
1934 * Import case: prevent zero window announcement if
1935 * 1<<rcv_wscale > mss.
1936 */
1937 if (((window >> tp->rx_opt.rcv_wscale) << tp->rx_opt.rcv_wscale) != window)
0xffffffff81628f73 <+179>: shr $0x4,%dl
0xffffffff81628f76 <+182>: movzbl %dl,%ecx
0xffffffff81628f79 <+185>: mov %r8d,%edx
0xffffffff81628f7c <+188>: sar %cl,%edx
0xffffffff81628f7e <+190>: mov %edx,%eax
0xffffffff81628f80 <+192>: shl %cl,%eax
0xffffffff81628f82 <+194>: cmp %eax,%r8d
0xffffffff81628f85 <+197>: je 0xffffffff81628fdd <__tcp_select_window+285>
1938 window = (((window >> tp->rx_opt.rcv_wscale) + 1)
0xffffffff81628f87 <+199>: lea 0x1(%rdx),%edi
0xffffffff81628f8a <+202>: shl %cl,%edi
1939 << tp->rx_opt.rcv_wscale);
1940 } else {
1941 /* Get the largest window that is a nice multiple of mss.
1942 * Window clamp already applied above.
1943 * If our current window offering is within 1 mss of the
1944 * free space we just keep it. This prevents the divide
1945 * and multiply from happening most of the time.
1946 * We also don't do any window rounding when the free space
1947 * is too small.
1948 */
1949 if (window <= free_space - mss || window > free_space)
0xffffffff81628f92 <+210>: mov %r8d,%eax
0xffffffff81628f95 <+213>: sub %esi,%eax
0xffffffff81628f97 <+215>: cmp %edi,%eax
0xffffffff81628f99 <+217>: jge 0xffffffff81628fa0 <__tcp_select_window+224>
0xffffffff81628f9b <+219>: cmp %r8d,%edi
0xffffffff81628f9e <+222>: jle 0xffffffff81628fd0 <__tcp_select_window+272>
1950 window = (free_space / mss) * mss;
0xffffffff81628fa0 <+224>: mov %r8d,%edx
0xffffffff81628fa3 <+227>: mov %r8d,%eax
0xffffffff81628fa6 <+230>: sar $0x1f,%edx
0xffffffff81628fa9 <+233>: idiv %esi
0xffffffff81628fab <+235>: mov %eax,%edi
0xffffffff81628fad <+237>: imul %esi,%edi
1951 else if (mss == full_space &&
0xffffffff81628fd0 <+272>: cmp %esi,%ecx
0xffffffff81628fd2 <+274>: jne 0xffffffff81628f8c <__tcp_select_window+204>
0xffffffff81628fd4 <+276>: lea (%r9,%rdi,1),%eax
0xffffffff81628fd8 <+280>: cmp %eax,%r8d
0xffffffff81628fdb <+283>: jle 0xffffffff81628f8c <__tcp_select_window+204>
0xffffffff81628fdd <+285>: mov %r8d,%edi
1952 free_space > window + (full_space >> 1))
1953 window = free_space;
1954 }
1955
1956 return window;
0xffffffff81628f8c <+204>: mov %edi,%eax
0xffffffff81628fb0 <+240>: mov %edi,%eax
0xffffffff81628fb2 <+242>: jmp 0xffffffff81628f8e <__tcp_select_window+206>
0xffffffff81628fb4 <+244>: nopl 0x0(%rax)
0xffffffff81628fe0 <+288>: mov %edi,%eax
0xffffffff81628fe2 <+290>: jmp 0xffffffff81628f8e <__tcp_select_window+206>
0xffffffff81628fe4: data32 data32 nopw %cs:0x0(%rax,%rax,1)
1957 }
0xffffffff81628f8e <+206>: leaveq
0xffffffff81628f8f <+207>: retq
End of assembler dump.
^ permalink raw reply
* Re: [PATCH 2/5] net/sunrpc: use kstrtoul, etc
From: Alexey Dobriyan @ 2011-11-08 20:45 UTC (permalink / raw)
To: Julia Lawall
Cc: Julia Lawall, J. Bruce Fields,
kernel-janitors-u79uwXL29TY76Z2rM5mHXA, Neil Brown,
Trond Myklebust, David S. Miller,
linux-nfs-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <alpine.DEB.2.02.1111082113380.1880@hadrien>
On Tue, Nov 08, 2011 at 09:19:30PM +0100, Julia Lawall wrote:
>
>
> On Tue, 8 Nov 2011, Alexey Dobriyan wrote:
>
> > On Sun, Nov 06, 2011 at 02:26:47PM +0100, Julia Lawall wrote:
> >> @@
> >> expression a,b;
> >> {int,long} *c;
> >> @@
> >>
> >> -strict_strtoul
> >> +kstrtoul
> >
> > No, no, no!
>
> Sorry, this was not the real rule I used for the strtoul case. Instead I
> used the following:
>
> @@
> typedef ulong;
> expression a,b;
> {ulong,unsigned long,unsigned int,size_t} *c;
> @@
>
> -strict_strtoul
> +kstrtoul
> (a,b,c)
>
> But now I have seen that there is a separate function for integers, so I
> have made a rule to use that function when the type is unsigned int.
>
> > In every case see the type or real data and use appropriate function.
> > kstrtou8() for ports.
>
> The type of the destination variable in all of these cases is unsigned
> long. But maybe that is not enough information to make the
> transformation in the right way.
That's because previous functions following libc didn't accept anything
less than unsigned long.
For these conversion, one should literally look at every usecase and
see what types data have for real (not unsigned long) and
make conversion and remove explicit EINVAL checks if necesasry.
In sunrpc case: switch to kstrtou8 + remove "> 255" check.
This program doesn't and won't do that.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [PATCH net-next] sch_choke: use skb_header_pointer()
From: Eric Dumazet @ 2011-11-08 20:45 UTC (permalink / raw)
To: David Miller; +Cc: netdev
Remove the assumption that skb_get_rxhash() makes IP header and ports
linear, and use skb_header_pointer() instead in choke_match_flow()
This permits __skb_get_rxhash() to use skb_header_pointer() eventually.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
net/sched/sch_choke.c | 27 +++++++++++++++++----------
1 file changed, 17 insertions(+), 10 deletions(-)
diff --git a/net/sched/sch_choke.c b/net/sched/sch_choke.c
index 3422b25..061bcb7 100644
--- a/net/sched/sch_choke.c
+++ b/net/sched/sch_choke.c
@@ -152,15 +152,14 @@ static bool choke_match_flow(struct sk_buff *skb1,
{
int off1, off2, poff;
const u32 *ports1, *ports2;
+ u32 _ports1, _ports2;
u8 ip_proto;
__u32 hash1;
if (skb1->protocol != skb2->protocol)
return false;
- /* Use hash value as quick check
- * Assumes that __skb_get_rxhash makes IP header and ports linear
- */
+ /* Use rxhash value as quick check */
hash1 = skb_get_rxhash(skb1);
if (!hash1 || hash1 != skb_get_rxhash(skb2))
return false;
@@ -172,10 +171,12 @@ static bool choke_match_flow(struct sk_buff *skb1,
switch (skb1->protocol) {
case __constant_htons(ETH_P_IP): {
const struct iphdr *ip1, *ip2;
+ struct iphdr _ip1, _ip2;
- ip1 = (const struct iphdr *) (skb1->data + off1);
- ip2 = (const struct iphdr *) (skb2->data + off2);
-
+ ip1 = skb_header_pointer(skb1, off1, sizeof(_ip1), &_ip1);
+ ip2 = skb_header_pointer(skb2, off2, sizeof(_ip2), &_ip2);
+ if (!ip1 || !ip2)
+ return false;
ip_proto = ip1->protocol;
if (ip_proto != ip2->protocol ||
ip1->saddr != ip2->saddr || ip1->daddr != ip2->daddr)
@@ -190,9 +191,12 @@ static bool choke_match_flow(struct sk_buff *skb1,
case __constant_htons(ETH_P_IPV6): {
const struct ipv6hdr *ip1, *ip2;
+ struct ipv6hdr _ip1, _ip2;
- ip1 = (const struct ipv6hdr *) (skb1->data + off1);
- ip2 = (const struct ipv6hdr *) (skb2->data + off2);
+ ip1 = skb_header_pointer(skb1, off1, sizeof(_ip1), &_ip1);
+ ip2 = skb_header_pointer(skb2, off2, sizeof(_ip2), &_ip2);
+ if (!ip1 || !ip2)
+ return false;
ip_proto = ip1->nexthdr;
if (ip_proto != ip2->nexthdr ||
@@ -214,8 +218,11 @@ static bool choke_match_flow(struct sk_buff *skb1,
off1 += poff;
off2 += poff;
- ports1 = (__force u32 *)(skb1->data + off1);
- ports2 = (__force u32 *)(skb2->data + off2);
+ ports1 = skb_header_pointer(skb1, off1, sizeof(_ports1), &_ports1);
+ ports2 = skb_header_pointer(skb2, off2, sizeof(_ports2), &_ports2);
+ if (!ports1 || !ports2)
+ return false;
+
return *ports1 == *ports2;
}
^ permalink raw reply related
* Re: [PATCH] net: drivers/net/hippi/Kconfig should be sourced
From: Jeff Kirsher @ 2011-11-08 20:40 UTC (permalink / raw)
To: Paul Bolle
Cc: David S. Miller, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org
In-Reply-To: <1320784270.14409.404.camel@x61.thuisdomein>
[-- Attachment #1: Type: text/plain, Size: 1123 bytes --]
On Tue, 2011-11-08 at 12:31 -0800, Paul Bolle wrote:
> Commit ff5a3b509e ("hippi: Move the HIPPI driver") moved the HIPPI
> driver into drivers/net/hippi. It didn't source
> drivers/net/hippi/Kconfig though, so it didn't make all necessary
> Kconfig changes. So let drivers/net/kconfig source HIPPI's Kconfig file.
>
> Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
> ---
> git grep tested only. Perhaps the exact spot where
> drivers/net/hippi/Kconfig gets sourced is relevant, so this needs
> (build) testing by people actually familiar with the HIPPI driver.
>
> drivers/net/Kconfig | 2 ++
> 1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index 583f66c..654a5e9 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -245,6 +245,8 @@ source "drivers/net/ethernet/Kconfig"
>
> source "drivers/net/fddi/Kconfig"
>
> +source "drivers/net/hippi/Kconfig"
> +
> config NET_SB1000
> tristate "General Instruments Surfboard 1000"
> depends on PNP
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply
* net-next open for business
From: David Miller @ 2011-11-08 20:36 UTC (permalink / raw)
To: netdev; +Cc: linux-wireless, netfilter-devel
As promised, now that the merge window is closed, the net-next is once
again open for business.
Damn the torpedoes, full speed ahead!
^ permalink raw reply
* [PATCH] net: drivers/net/hippi/Kconfig should be sourced
From: Paul Bolle @ 2011-11-08 20:31 UTC (permalink / raw)
To: David S. Miller; +Cc: netdev, linux-kernel, Jeff Kirsher
Commit ff5a3b509e ("hippi: Move the HIPPI driver") moved the HIPPI
driver into drivers/net/hippi. It didn't source
drivers/net/hippi/Kconfig though, so it didn't make all necessary
Kconfig changes. So let drivers/net/kconfig source HIPPI's Kconfig file.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
---
git grep tested only. Perhaps the exact spot where
drivers/net/hippi/Kconfig gets sourced is relevant, so this needs
(build) testing by people actually familiar with the HIPPI driver.
drivers/net/Kconfig | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 583f66c..654a5e9 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -245,6 +245,8 @@ source "drivers/net/ethernet/Kconfig"
source "drivers/net/fddi/Kconfig"
+source "drivers/net/hippi/Kconfig"
+
config NET_SB1000
tristate "General Instruments Surfboard 1000"
depends on PNP
--
1.7.4.4
^ permalink raw reply related
* Re: [PATCH v2 net-next] sweep the floors and convert some .get_drvinfo routines to strlcpy
From: David Miller @ 2011-11-08 20:17 UTC (permalink / raw)
To: raj
Cc: netdev, dave, klassert, ionut, jcliburn, chris.snook, mchan,
eilong, mcarlson, rmody, grundler, ajit.khaparde, buytenh,
shemminger, thockin, jdmason, sony.chacko, anirban.chakraborty,
romieu, steve.glendinning
In-Reply-To: <20111107232927.ADF412900440@tardy>
From: raj@tardy.cup.hp.com (Rick Jones)
Date: Mon, 7 Nov 2011 15:29:27 -0800 (PST)
> From: Rick Jones <rick.jones2@hp.com>
>
> Per the mention made by Ben Hutchings that strlcpy is now the preferred
> string copy routine for a .get_drvinfo routine, do a bit of floor
> sweeping and convert some of the as-yet unconverted ethernet drivers to
> it.
>
> Signed-off-by: Rick Jones <rick.jones2@hp.com>
Applied.
^ permalink raw reply
* Re: [PATCH 2/5] net/sunrpc: use kstrtoul, etc
From: Julia Lawall @ 2011-11-08 20:19 UTC (permalink / raw)
To: Alexey Dobriyan
Cc: Julia Lawall, J. Bruce Fields, kernel-janitors, Neil Brown,
Trond Myklebust, David S. Miller, linux-nfs, netdev, linux-kernel
In-Reply-To: <20111108193817.GA3453@p183.telecom.by>
On Tue, 8 Nov 2011, Alexey Dobriyan wrote:
> On Sun, Nov 06, 2011 at 02:26:47PM +0100, Julia Lawall wrote:
>> @@
>> expression a,b;
>> {int,long} *c;
>> @@
>>
>> -strict_strtoul
>> +kstrtoul
>
> No, no, no!
Sorry, this was not the real rule I used for the strtoul case. Instead I
used the following:
@@
typedef ulong;
expression a,b;
{ulong,unsigned long,unsigned int,size_t} *c;
@@
-strict_strtoul
+kstrtoul
(a,b,c)
But now I have seen that there is a separate function for integers, so I
have made a rule to use that function when the type is unsigned int.
> In every case see the type or real data and use appropriate function.
> kstrtou8() for ports.
The type of the destination variable in all of these cases is unsigned
long. But maybe that is not enough information to make the
transformation in the right way.
julia
> This program creates lots of bogus patches in this case.
>
>> --- a/net/sunrpc/addr.c
>> +++ b/net/sunrpc/addr.c
>> @@ -322,7 +322,7 @@ size_t rpc_uaddr2sockaddr(const char *ua
>> c = strrchr(buf, '.');
>> if (unlikely(c == NULL))
>> return 0;
>> - if (unlikely(strict_strtoul(c + 1, 10, &portlo) != 0))
>> + if (unlikely(kstrtoul(c + 1, 10, &portlo) != 0))
>> return 0;
>> if (unlikely(portlo > 255))
>> return 0;
>> @@ -331,7 +331,7 @@ size_t rpc_uaddr2sockaddr(const char *ua
>> c = strrchr(buf, '.');
>> if (unlikely(c == NULL))
>> return 0;
>> - if (unlikely(strict_strtoul(c + 1, 10, &porthi) != 0))
>> + if (unlikely(kstrtoul(c + 1, 10, &porthi) != 0))
>> return 0;
>> if (unlikely(porthi > 255))
>> return 0;
> --
> To unsubscribe from this list: send the line "unsubscribe kernel-janitors" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply
* Re: [PATCH] net: better pcpu data alignment
From: David Miller @ 2011-11-08 20:17 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev
In-Reply-To: <1320484768.16609.22.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 05 Nov 2011 10:19:28 +0100
> Tunnels can force an alignment of their percpu data to reduce number of
> cache lines used in fast path, or read in .ndo_get_stats()
>
> percpu_alloc() is a very fine grained allocator, so any small hole will
> be used anyway.
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Applied.
^ permalink raw reply
* Re: [PATCH 2/2] net: make ipv6 PKTINFO honour freebind
From: David Miller @ 2011-11-08 20:17 UTC (permalink / raw)
To: zenczykowski; +Cc: maze, netdev
In-Reply-To: <1320713842-21152-2-git-send-email-zenczykowski@gmail.com>
From: Maciej Żenczykowski <zenczykowski@gmail.com>
Date: Mon, 7 Nov 2011 16:57:22 -0800
> From: Maciej Żenczykowski <maze@google.com>
>
> This just makes it possible to spoof source IPv6 address on a socket
> without having to create and bind a new socket for every source IP
> we wish to spoof.
>
> Signed-off-by: Maciej Żenczykowski <maze@google.com>
Applied.
^ permalink raw reply
* Re: [PATCH 1/2] net: make ipv6 bind honour freebind
From: David Miller @ 2011-11-08 20:17 UTC (permalink / raw)
To: zenczykowski; +Cc: maze, netdev
In-Reply-To: <1320713842-21152-1-git-send-email-zenczykowski@gmail.com>
From: Maciej Żenczykowski <zenczykowski@gmail.com>
Date: Mon, 7 Nov 2011 16:57:21 -0800
> From: Maciej Żenczykowski <maze@google.com>
>
> This makes native ipv6 bind follow the precedent set by:
> - native ipv4 bind behaviour
> - dual stack ipv4-mapped ipv6 bind behaviour.
>
> This does allow an unpriviledged process to spoof its source IPv6
> address, just like it currently can spoof its source IPv4 address
> (for example when using UDP).
>
> Signed-off-by: Maciej Żenczykowski <maze@google.com>
Applied.
^ permalink raw reply
* Re: [RESEND] ll_temac: Add support for phy_mii_ioctl
From: David Miller @ 2011-11-08 20:16 UTC (permalink / raw)
To: ricardo.ribalda
Cc: ian.campbell, eric.dumazet, jeffrey.t.kirsher, jpirko, netdev,
linux-kernel
In-Reply-To: <1320745665-11688-1-git-send-email-ricardo.ribalda@gmail.com>
From: Ricardo Ribalda Delgado <ricardo.ribalda@gmail.com>
Date: Tue, 8 Nov 2011 10:47:45 +0100
> This patch enables the ioctl support for the driver. So userspace
> programs like mii-tool can work.
>
> Resend in merge window
>
> Signed-off-by: Ricardo Ribalda Delgado <ricardo.ribalda@gmail.com>
Applied.
^ permalink raw reply
* Re: [PATCH 0/2] sctp: Patches for fasthandoff
From: David Miller @ 2011-11-08 20:16 UTC (permalink / raw)
To: micchie; +Cc: weiyj.lk, netdev
In-Reply-To: <27205644-2484-4072-8689-1504958C9B73@sfc.wide.ad.jp>
From: Michio Honda <micchie@sfc.wide.ad.jp>
Date: Wed, 9 Nov 2011 01:23:01 +0900
> Series of 2 patches for retransmission immediately after the address reconfiguration including disconnected period.
Both applied, thanks.
^ permalink raw reply
* [PATCH] neigh: increase unres_qlen by one magnitude
From: Eric Dumazet @ 2011-11-08 20:11 UTC (permalink / raw)
To: David Miller; +Cc: netdev
unres_qlen is the number of frames we are able to queue per unresolved
neighbour. Its default value (3) was never changed and is responsible
for strange drops, especially if IP fragments are used, or multiple
sessions start in parallel. TCP initial congestion window is now bigger
than 3.
$ arp -d 192.168.20.108 ; ping -c 2 -s 8000 192.168.20.108
PING 192.168.20.108 (192.168.20.108) 8000(8028) bytes of data.
8008 bytes from 192.168.20.108: icmp_seq=2 ttl=64 time=0.322 ms
--- 192.168.20.108 ping statistics ---
2 packets transmitted, 1 received, 50% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.322/0.322/0.322/0.000 ms
Since available memory per machine increased quite a lot since 1999, I
believe its safe to expand unres_qlen to a more reasonable value.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
Documentation/networking/ip-sysctl.txt | 4 ++++
net/atm/clip.c | 2 +-
net/decnet/dn_neigh.c | 2 +-
net/ipv4/arp.c | 2 +-
net/ipv6/ndisc.c | 2 +-
5 files changed, 8 insertions(+), 4 deletions(-)
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index cb7f314..ff44664 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -31,6 +31,10 @@ neigh/default/gc_thresh3 - INTEGER
when using large numbers of interfaces and when communicating
with large numbers of directly-connected peers.
+neigh/default/unres_qlen - INTEGER
+ The maximum number of packets which may be queued for each
+ unresolved address by other network layers.
+
mtu_expires - INTEGER
Time, in seconds, that cached PMTU information is kept.
diff --git a/net/atm/clip.c b/net/atm/clip.c
index 8523940..50ebab6 100644
--- a/net/atm/clip.c
+++ b/net/atm/clip.c
@@ -329,7 +329,7 @@ static struct neigh_table clip_tbl = {
.gc_staletime = 60 * HZ,
.reachable_time = 30 * HZ,
.delay_probe_time = 5 * HZ,
- .queue_len = 3,
+ .queue_len = 32,
.ucast_probes = 3,
.mcast_probes = 3,
.anycast_delay = 1 * HZ,
diff --git a/net/decnet/dn_neigh.c b/net/decnet/dn_neigh.c
index 7f0eb08..c26abe2 100644
--- a/net/decnet/dn_neigh.c
+++ b/net/decnet/dn_neigh.c
@@ -107,7 +107,7 @@ struct neigh_table dn_neigh_table = {
.gc_staletime = 60 * HZ,
.reachable_time = 30 * HZ,
.delay_probe_time = 5 * HZ,
- .queue_len = 3,
+ .queue_len = 32,
.ucast_probes = 0,
.app_probes = 0,
.mcast_probes = 0,
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index 96a164a..66e7eb0 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -177,7 +177,7 @@ struct neigh_table arp_tbl = {
.gc_staletime = 60 * HZ,
.reachable_time = 30 * HZ,
.delay_probe_time = 5 * HZ,
- .queue_len = 3,
+ .queue_len = 32,
.ucast_probes = 3,
.mcast_probes = 3,
.anycast_delay = 1 * HZ,
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index 44e5b7f..0250a88 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -141,7 +141,7 @@ struct neigh_table nd_tbl = {
.gc_staletime = 60 * HZ,
.reachable_time = ND_REACHABLE_TIME,
.delay_probe_time = 5 * HZ,
- .queue_len = 3,
+ .queue_len = 32,
.ucast_probes = 3,
.mcast_probes = 3,
.anycast_delay = 1 * HZ,
^ permalink raw reply related
* Re: [PATCH net-next] W5300: Add WIZnet W5300 Ethernet driver
From: Ben Hutchings @ 2011-11-08 20:06 UTC (permalink / raw)
To: Taehun Kim; +Cc: David S. Miller, netdev, linux-kernel, suhwan, bongbong
In-Reply-To: <1320779368.2799.55.camel@bwh-desktop>
On Tue, 2011-11-08 at 19:09 +0000, Ben Hutchings wrote:
> On Mon, 2011-11-07 at 23:37 +0900, Taehun Kim wrote:
[...]
> > +/* Default MAC address. */
> > +static __initdata u8 w5300_defmac[6] = {0x00, 0x08, 0xDC, 0xA0, 0x00, 0x01};
>
> This is not suitable as a default MAC address.
Really you mustn't use a fixed default at all.
[...]
> > +/* Interrupt Handler(ISR) */
> > +static irqreturn_t wiz_interrupt(int irq, void *dev_instance)
> > +{
> > + struct net_device *dev = dev_instance;
> > + struct wiz_private *wp = netdev_priv(dev);
> > + int timeout = 100;
> > + u16 isr, ssr;
> > + int s;
> > +
> > + isr = w5300_read(wp, IR);
> > +
> > + /* Completing all interrupts at a time. */
> > + while (isr && timeout--) {
>
> Why would you need to repeat this? You disable the interrupt
[...]
I'm not sure what I was starting to say there.
But I really don't see any justification for this loop. Perhaps it's
left over from a non-NAPI implementation? Just acknowledge the
interrupt, schedule NAPI as appropriate, and let the kernel call the
interrupt handler again if another interrupt is raised.
Ben.
--
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply
* [stable v2] net: Handle different key sizes between address families in flow cache
From: Kim Phillips @ 2011-11-08 19:44 UTC (permalink / raw)
To: Greg KH; +Cc: David Miller, stable, eric.dumazet, zheng.z.yan, netdev,
David Ward
In-Reply-To: <20111108165317.GB25206@kroah.com>
commit aa1c366e4febc7f5c2b84958a2dd7cd70e28f9d0 upstream.
With the conversion of struct flowi to a union of AF-specific structs, some
operations on the flow cache need to account for the exact size of the key.
Signed-off-by: David Ward <david.ward@ll.mit.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: <stable@vger.kernel.org> # v3.0.x: 728871b: net: Align AF-specific flowi structs to long
Cc: <stable@vger.kernel.org> # v3.0.x
---
v2: resend, with 64-bit build fix dependency 728871b included in sign-off area
include/net/flow.h | 19 +++++++++++++++++++
net/core/flow.c | 31 +++++++++++++++++--------------
2 files changed, 36 insertions(+), 14 deletions(-)
diff --git a/include/net/flow.h b/include/net/flow.h
index c6d5fe5..93a8785 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -7,6 +7,7 @@
#ifndef _NET_FLOW_H
#define _NET_FLOW_H
+#include <linux/socket.h>
#include <linux/in6.h>
#include <asm/atomic.h>
@@ -161,6 +162,24 @@ static inline struct flowi *flowidn_to_flowi(struct flowidn *fldn)
return container_of(fldn, struct flowi, u.dn);
}
+typedef unsigned long flow_compare_t;
+
+static inline size_t flow_key_size(u16 family)
+{
+ switch (family) {
+ case AF_INET:
+ BUILD_BUG_ON(sizeof(struct flowi4) % sizeof(flow_compare_t));
+ return sizeof(struct flowi4) / sizeof(flow_compare_t);
+ case AF_INET6:
+ BUILD_BUG_ON(sizeof(struct flowi6) % sizeof(flow_compare_t));
+ return sizeof(struct flowi6) / sizeof(flow_compare_t);
+ case AF_DECnet:
+ BUILD_BUG_ON(sizeof(struct flowidn) % sizeof(flow_compare_t));
+ return sizeof(struct flowidn) / sizeof(flow_compare_t);
+ }
+ return 0;
+}
+
#define FLOW_DIR_IN 0
#define FLOW_DIR_OUT 1
#define FLOW_DIR_FWD 2
diff --git a/net/core/flow.c b/net/core/flow.c
index 990703b..a6bda2a 100644
--- a/net/core/flow.c
+++ b/net/core/flow.c
@@ -172,29 +172,26 @@ static void flow_new_hash_rnd(struct flow_cache *fc,
static u32 flow_hash_code(struct flow_cache *fc,
struct flow_cache_percpu *fcp,
- const struct flowi *key)
+ const struct flowi *key,
+ size_t keysize)
{
const u32 *k = (const u32 *) key;
+ const u32 length = keysize * sizeof(flow_compare_t) / sizeof(u32);
- return jhash2(k, (sizeof(*key) / sizeof(u32)), fcp->hash_rnd)
+ return jhash2(k, length, fcp->hash_rnd)
& (flow_cache_hash_size(fc) - 1);
}
-typedef unsigned long flow_compare_t;
-
/* I hear what you're saying, use memcmp. But memcmp cannot make
- * important assumptions that we can here, such as alignment and
- * constant size.
+ * important assumptions that we can here, such as alignment.
*/
-static int flow_key_compare(const struct flowi *key1, const struct flowi *key2)
+static int flow_key_compare(const struct flowi *key1, const struct flowi *key2,
+ size_t keysize)
{
const flow_compare_t *k1, *k1_lim, *k2;
- const int n_elem = sizeof(struct flowi) / sizeof(flow_compare_t);
-
- BUILD_BUG_ON(sizeof(struct flowi) % sizeof(flow_compare_t));
k1 = (const flow_compare_t *) key1;
- k1_lim = k1 + n_elem;
+ k1_lim = k1 + keysize;
k2 = (const flow_compare_t *) key2;
@@ -215,6 +212,7 @@ flow_cache_lookup(struct net *net, const struct flowi *key, u16 family, u8 dir,
struct flow_cache_entry *fle, *tfle;
struct hlist_node *entry;
struct flow_cache_object *flo;
+ size_t keysize;
unsigned int hash;
local_bh_disable();
@@ -222,6 +220,11 @@ flow_cache_lookup(struct net *net, const struct flowi *key, u16 family, u8 dir,
fle = NULL;
flo = NULL;
+
+ keysize = flow_key_size(family);
+ if (!keysize)
+ goto nocache;
+
/* Packet really early in init? Making flow_cache_init a
* pre-smp initcall would solve this. --RR */
if (!fcp->hash_table)
@@ -230,11 +233,11 @@ flow_cache_lookup(struct net *net, const struct flowi *key, u16 family, u8 dir,
if (fcp->hash_rnd_recalc)
flow_new_hash_rnd(fc, fcp);
- hash = flow_hash_code(fc, fcp, key);
+ hash = flow_hash_code(fc, fcp, key, keysize);
hlist_for_each_entry(tfle, entry, &fcp->hash_table[hash], u.hlist) {
if (tfle->family == family &&
tfle->dir == dir &&
- flow_key_compare(key, &tfle->key) == 0) {
+ flow_key_compare(key, &tfle->key, keysize) == 0) {
fle = tfle;
break;
}
@@ -248,7 +251,7 @@ flow_cache_lookup(struct net *net, const struct flowi *key, u16 family, u8 dir,
if (fle) {
fle->family = family;
fle->dir = dir;
- memcpy(&fle->key, key, sizeof(*key));
+ memcpy(&fle->key, key, keysize * sizeof(flow_compare_t));
fle->object = NULL;
hlist_add_head(&fle->u.hlist, &fcp->hash_table[hash]);
fcp->hash_count++;
--
1.7.7.2
^ permalink raw reply related
* Re: [PATCH net 0/4] ipv4: various pmtu discovery fixes
From: David Miller @ 2011-11-08 19:41 UTC (permalink / raw)
To: steffen.klassert; +Cc: netdev
In-Reply-To: <20111011.155451.870811156460684722.davem@davemloft.net>
Patch 4/4 looked fine so I applied that one as-is, thanks!
^ permalink raw reply
* Re: [PATCH 3/4] ipv4: Fix inetpeer expiration handling
From: David Miller @ 2011-11-08 19:38 UTC (permalink / raw)
To: gaofeng; +Cc: steffen.klassert, netdev
In-Reply-To: <4E9FBEA1.2050407@cn.fujitsu.com>
From: Gao feng <gaofeng@cn.fujitsu.com>
Date: Thu, 20 Oct 2011 14:24:33 +0800
> there are serval problem.
> 1:rt->peer maybe null,we should call rt_bind_peer just like the code below.
> 2:rt->peer_pmtu_orig is null. if we hasn't send packet before,the func check_peer_pmtu hasn't be called.
> so the peer->pmtu_orig is null.
> 3:when rt->rt_peer_genid != rt_peer_genid(), we has no need to do dst_metric_set(dst, RTAX_MTU, rt->peer->pmtu_orig),
> because check_peer_pmtu will do this.
>
> what about this patch?
In the case where no peer was created and we use default metrics and
other settings (can be common for UDP) I don't think it's wise to make
an inetpeer lookup every time we check the dst.
^ permalink raw reply
* Re: [PATCH 2/5] net/sunrpc: use kstrtoul, etc
From: Alexey Dobriyan @ 2011-11-08 19:38 UTC (permalink / raw)
To: Julia Lawall
Cc: J. Bruce Fields, kernel-janitors-u79uwXL29TY76Z2rM5mHXA,
Neil Brown, Trond Myklebust, David S. Miller,
linux-nfs-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1320586010-21931-3-git-send-email-julia-dAYI7NvHqcQ@public.gmane.org>
On Sun, Nov 06, 2011 at 02:26:47PM +0100, Julia Lawall wrote:
> @@
> expression a,b;
> {int,long} *c;
> @@
>
> -strict_strtoul
> +kstrtoul
No, no, no!
In every case see the type or real data and use appropriate function.
kstrtou8() for ports.
This program creates lots of bogus patches in this case.
> --- a/net/sunrpc/addr.c
> +++ b/net/sunrpc/addr.c
> @@ -322,7 +322,7 @@ size_t rpc_uaddr2sockaddr(const char *ua
> c = strrchr(buf, '.');
> if (unlikely(c == NULL))
> return 0;
> - if (unlikely(strict_strtoul(c + 1, 10, &portlo) != 0))
> + if (unlikely(kstrtoul(c + 1, 10, &portlo) != 0))
> return 0;
> if (unlikely(portlo > 255))
> return 0;
> @@ -331,7 +331,7 @@ size_t rpc_uaddr2sockaddr(const char *ua
> c = strrchr(buf, '.');
> if (unlikely(c == NULL))
> return 0;
> - if (unlikely(strict_strtoul(c + 1, 10, &porthi) != 0))
> + if (unlikely(kstrtoul(c + 1, 10, &porthi) != 0))
> return 0;
> if (unlikely(porthi > 255))
> return 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 2/4] ipv4: Update pmtu informations on inetpeer only for output routes
From: David Miller @ 2011-11-08 19:36 UTC (permalink / raw)
To: steffen.klassert; +Cc: netdev
In-Reply-To: <20111012.170805.2172804476308993385.davem@davemloft.net>
From: David Miller <davem@davemloft.net>
Date: Wed, 12 Oct 2011 17:08:05 -0400 (EDT)
> From: Steffen Klassert <steffen.klassert@secunet.com>
> Date: Tue, 11 Oct 2011 13:10:27 +0200
>
>> @@ -1817,9 +1819,14 @@ static void rt_init_metrics(struct rtable *rt, const struct flowi4 *fl4,
>> if (inet_metrics_new(peer))
>> memcpy(peer->metrics, fi->fib_metrics,
>> sizeof(u32) * RTAX_MAX);
>> - dst_init_metrics(&rt->dst, peer->metrics, false);
>>
>> - check_peer_pmtu(&rt->dst, peer);
>> + dst_init_metrics(dst, peer->metrics, false);
>> + check_peer_pmtu(dst, peer);
>> +
>> + if (rt_is_input_route(rt))
>> + dst_metric_set(dst, RTAX_MTU,
>> + dst->ops->default_mtu(dst));
>> +
>
> You really can't do this, it's going to kill all of the memory savings from
> storing metrics in the inetpeer cache.
>
> Every input route is going to have it's metrics COW'd with this change.
>
> The whole idea is to use defaults as heavily as possible, and that's
> the entire reason why the dst->ops->default_mtu() method exists, so
> that we can just leave the values alone and have read-only copies %99
> of the time.
>
> Please rearrange your fix so that these goals are still achieved.
What I think you can do to solve this problem is explicitly use
dst->ops->default_mtu() in ip_forward() instead of dst_mtu().
That way you won't use the cached PMTU for input routes.
^ permalink raw reply
* Re: [PATCH 1/4] ipv4: Fix pmtu propagating
From: David Miller @ 2011-11-08 19:33 UTC (permalink / raw)
To: gaofeng; +Cc: steffen.klassert, netdev
In-Reply-To: <20111108.141950.1576262997640523669.davem@davemloft.net>
From: David Miller <davem@davemloft.net>
Date: Tue, 08 Nov 2011 14:19:50 -0500 (EST)
> I suspect that your real problem has nothing to do with UDP or RAW,
> but rather the issue is that entries already in the routing cache
> with a NULL peer need to be refreshed with peer information created
> in another context.
So you want something like this patch:
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 155138d..2966631 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1677,12 +1677,8 @@ static int check_peer_redir(struct dst_entry *dst, struct inet_peer *peer)
return 0;
}
-static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie)
+static struct rtable *ipv4_validate_peer(struct rtable *rt)
{
- struct rtable *rt = (struct rtable *) dst;
-
- if (rt_is_expired(rt))
- return NULL;
if (rt->rt_peer_genid != rt_peer_genid()) {
struct inet_peer *peer;
@@ -1691,17 +1687,27 @@ static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie)
peer = rt->peer;
if (peer) {
- check_peer_pmtu(dst, peer);
+ check_peer_pmtu(&rt->dst, peer);
if (peer->redirect_learned.a4 &&
peer->redirect_learned.a4 != rt->rt_gateway) {
- if (check_peer_redir(dst, peer))
+ if (check_peer_redir(&rt->dst, peer))
return NULL;
}
}
rt->rt_peer_genid = rt_peer_genid();
}
+ return rt;
+}
+
+static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie)
+{
+ struct rtable *rt = (struct rtable *) dst;
+
+ if (rt_is_expired(rt))
+ return NULL;
+ dst = (struct dst_entry *) ipv4_validate_peer(rt);
return dst;
}
@@ -2349,6 +2355,9 @@ int ip_route_input_common(struct sk_buff *skb, __be32 daddr, __be32 saddr,
rth->rt_mark == skb->mark &&
net_eq(dev_net(rth->dst.dev), net) &&
!rt_is_expired(rth)) {
+ rth = ipv4_validate_peer(rth);
+ if (!rth)
+ continue;
if (noref) {
dst_use_noref(&rth->dst, jiffies);
skb_dst_set_noref(skb, &rth->dst);
@@ -2724,6 +2733,9 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *flp4)
(IPTOS_RT_MASK | RTO_ONLINK)) &&
net_eq(dev_net(rth->dst.dev), net) &&
!rt_is_expired(rth)) {
+ rth = ipv4_validate_peer(rth);
+ if (!rth)
+ continue;
dst_use(&rth->dst, jiffies);
RT_CACHE_STAT_INC(out_hit);
rcu_read_unlock_bh();
^ permalink raw reply related
* Re: [PATCH] net: min_pmtu default is 552
From: David Miller @ 2011-11-08 19:21 UTC (permalink / raw)
To: eric.dumazet; +Cc: netdev
In-Reply-To: <1320779428.2588.2.camel@edumazet-laptop>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 08 Nov 2011 20:10:28 +0100
> Small fix in Documentation, since min_pmtu is 512 + 20 + 20 = 552
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Applied, thanks.
^ permalink raw reply
* Re: [PATCH 1/4] ipv4: Fix pmtu propagating
From: David Miller @ 2011-11-08 19:19 UTC (permalink / raw)
To: gaofeng; +Cc: steffen.klassert, netdev
In-Reply-To: <20111019.153208.1743995025598236810.davem@davemloft.net>
Steffen I look at this specific patch again.
The peer should be found and loaded up, and the PMTU propagated, when
the route cache entry is created. Specifically rt_init_metrics() will
find any existing peer, and do the whole check_peer_pmtu() sequence
that ipv4_dst_check() does.
If we have some issue with UDP or RAW caching a route past the first
use after the route lookup, yes we have to introduce a dst_check()
call somewhere.
But unilaterally doing this on every CORK setup seems at least very
excessive. Because CORK setup happens even for single sends. One
such example is ip_make_skb().
ip_make_skb() is used by, f.e., udp_sendmsg() when corkreq is false.
And in this case 'rt' is used immediately after being looked up
in udp_sendmsg().
And, as far as I can tell, routines like udp_sendmsg() in fact already
handle the "cached route across multiple sendmsg() calls" case too.
Specifically, udp_sendmsg() does this:
if (connected)
rt = (struct rtable *)sk_dst_check(sk, 0);
otherwise it makes a completely fresh route lookup.
RAW sendmsg unconditionally makes a fresh route lookup on every
sendmsg call. So it should be OK too.
So I really fail to see the problematic case. Therefore, if it exists
you'll have to give me an exact sequence of events that leads to the
problem.
I suspect that your real problem has nothing to do with UDP or RAW,
but rather the issue is that entries already in the routing cache
with a NULL peer need to be refreshed with peer information created
in another context.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox