From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Zhang, Yanmin" Subject: Re: IPF Montvale machine panic when running a network-relevent testing Date: Wed, 18 Jun 2008 11:27:43 +0800 Message-ID: <1213759663.25608.33.camel@ymzhang> References: <1213345160.25608.3.camel@ymzhang> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: LKML , Linux-IA64 To: netdev@vger.kernel.org, David Miller Return-path: In-Reply-To: <1213345160.25608.3.camel@ymzhang> Sender: linux-ia64-owner@vger.kernel.org List-Id: netdev.vger.kernel.org On Fri, 2008-06-13 at 16:19 +0800, Zhang, Yanmin wrote: > With kernel 2.6.26-rc5 and a git kernel just between rc4 and rc5, my > kernel panic on my Montvale machine when I did an initial specweb2005 > testing between 2 machines. >=20 > Below is the log. >=20 > LOGIN: Unable to handle kernel NULL pointer dereference (address 0000= 000000000000) > Thread-7266[13494]: Oops 8804682956800 [1] > Modules linked in: >=20 > Pid: 13494, CPU 0, comm: Thread-7266 > psr : 0000101008026018 ifs : 800000000000050e ip : [] Not tainted (2.6.26-rc4git) > ip is at tcp_rcv_established+0x1450/0x16e0 > unat: 0000000000000000 pfs : 000000000000050e rsc : 0000000000000003 > rnat: 0000000000000000 bsps: 0000000000000000 pr : 000000000059656b > ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f > csd : 0000000000000000 ssd : 0000000000000000 > b0 : a00000010087a410 b6 : a0000001004c7ac0 b7 : a0000001004c64e0 > f6 : 000000000000000000000 f7 : 1003e0000000000000b80 > f8 : 10000821f080500000000 f9 : 1003efffffffffffffa58 > f10 : 1003edbb7db5f6be58df8 f11 : 1003e0000000000000015 > r1 : a0000001010cce90 r2 : e0000003d4530c40 r3 : 0000000000000105 > r8 : e000000402533d68 r9 : e000000402533a80 r10 : e000000402533bfc > r11 : 0000000000000004 r12 : e0000003d4537df0 r13 : e0000003d4530000 > r14 : 0000000000000000 r15 : e000000401fca180 r16 : e0000003d4530c68 > r17 : e000000402572238 r18 : 00000000000000ff r19 : a0000001012c6630 > r20 : e0000003d4530c68 r21 : e000000401fca480 r22 : e000000402572658 > r23 : e000000402572240 r24 : a0000001012c4e04 r25 : 0000000000000003 > r26 : e000000401fca4a8 r27 : e000000402572660 r28 : e00000040a2d2a00 > r29 : e00000040a6f83a8 r30 : e00000040a6f8300 r31 : 000000000000000a >=20 > Call Trace: > [] show_stack+0x40/0xa0 > sp=3De0000003d45379c0 bsp=3De0000003d= 4531440 > [] show_regs+0x850/0x8a0 > sp=3De0000003d4537b90 bsp=3De0000003d= 45313e0 > [] die+0x230/0x360 > sp=3De0000003d4537b90 bsp=3De0000003d= 4531398 > [] ia64_do_page_fault+0x8e0/0xa40 > sp=3De0000003d4537b90 bsp=3De0000003d= 4531348 > [] ia64_leave_kernel+0x0/0x280 > sp=3De0000003d4537c20 bsp=3De0000003d= 4531348 > [] tcp_rcv_established+0x1450/0x16e0 > sp=3De0000003d4537df0 bsp=3De0000003d= 45312d8 > [] tcp_v4_do_rcv+0x70/0x500 > sp=3De0000003d4537df0 bsp=3De0000003d= 4531298 > [] tcp_v4_rcv+0xfb0/0x1060 > sp=3De0000003d4537e00 bsp=3De0000003d= 4531248 >=20 >=20 >=20 > As a matter of fact, kernel paniced at statement > "queue->rskq_accept_tail->dl_next =3D req" in function =EF=BB=BFreqsk= _queue_add, because > =EF=BB=BFqueue->rskq_accept_tail is NULL. The call chain is: > =EF=BB=BF=EF=BB=BFtcp_rcv_established =3D> inet_csk_reqsk_queue_add =3D= > reqsk_queue_add. The call chain is: =EF=BB=BFtcp_rcv_established =3D> tcp_defer_accept_check =3D> inet_csk_= reqsk_queue_add =3D> reqsk_queue_add >=20 > As I was running an initial specweb2005(configured 3500 sessions) tes= ting between > 2 machines, there were lots of failure and many network connections w= ere > reestablished during the testing. >=20 > In function tcp_v4_rcv, bh_lock_sock_nested(sk) (a kind of spinlock) = is used to > avoid race. But inet_csk_accept uses lock_sock(sk) (a kind of sleeper= ). Although > lock_sock also accesses sk->sk_lock.slock, it looks like there is a r= ace. This issue is caused by tcp defer accept. Mostly, process context calls= lock_sock to apply a sleeping lock. BH (SoftIRQ) context calls bh_lock_sock(_nest= ed) to just apply for the sk->sk_lock.slock without sleeping, then do appropriate process= ing based on if sk->sk_lock.owned=3D=3D0. That works well if both process context an= d BH context operate the same sk at the same time. But with =EF=BB=BFtcp defer accept, it do= esn't, because process context(for example, in inet_csk_accept) locks the listen sk, w= hile BH context (in tcp_v4_rcv, for example) locks the child sk and calls =EF=BB=BFtcp_defer_accept_check=EF=BB=BF =3D> inet_csk_reqsk_queue_add = =3D> reqsk_queue_add, so there is a race to access the listen sock. Below patch against 2.6.26-rc6 fixes the issue. =EF=BB=BF=EF=BB=BFSigned-off-by: Zhang Yanmin --- --- linux-2.6.26-rc6/net/ipv4/inet_connection_sock.c 2008-06-17 12:26:5= 0.000000000 +0800 +++ linux-2.6.26-rc6_tcp/net/ipv4/inet_connection_sock.c 2008-06-17 16:= 41:07.000000000 +0800 @@ -257,7 +257,10 @@ struct sock *inet_csk_accept(struct sock goto out_err; } =20 + lock_sock_bh(sk); newsk =3D reqsk_queue_get_child(&icsk->icsk_accept_queue, sk); + unlock_sock_bh(sk); + BUG_TRAP(newsk->sk_state !=3D TCP_SYN_RECV); out: release_sock(sk); @@ -602,7 +605,9 @@ void inet_csk_listen_stop(struct sock *s inet_csk_delete_keepalive_timer(sk); =20 /* make all the listen_opt local to us */ + lock_sock_bh(sk); acc_req =3D reqsk_queue_yank_acceptq(&icsk->icsk_accept_queue); + unlock_sock_bh(sk); =20 /* Following specs, it would be better either to send FIN * (and enter FIN-WAIT-1, it is normal close) --- linux-2.6.26-rc6/net/ipv4/tcp_input.c 2008-06-17 12:26:50.000000000= +0800 +++ linux-2.6.26-rc6_tcp/net/ipv4/tcp_input.c 2008-06-17 16:43:35.00000= 0000 +0800 @@ -4554,6 +4554,8 @@ static int tcp_defer_accept_check(struct if (queued_data && hasfin) queued_data--; =20 + bh_lock_sock_nested(tp->defer_tcp_accept.listen_sk); + if (queued_data && tp->defer_tcp_accept.listen_sk->sk_state =3D=3D TCP_LISTEN) { if (sock_flag(sk, SOCK_KEEPOPEN)) { @@ -4568,6 +4570,8 @@ static int tcp_defer_accept_check(struct tp->defer_tcp_accept.request, sk); =20 + bh_unlock_sock(tp->defer_tcp_accept.listen_sk); + tp->defer_tcp_accept.listen_sk->sk_data_ready( tp->defer_tcp_accept.listen_sk, 0); =20 @@ -4577,6 +4581,7 @@ static int tcp_defer_accept_check(struct tp->defer_tcp_accept.request =3D NULL; } else if (hasfin || tp->defer_tcp_accept.listen_sk->sk_state !=3D TCP_LISTEN) { + bh_unlock_sock(tp->defer_tcp_accept.listen_sk); tcp_reset(sk); return -1; } --- linux-2.6.26-rc6/include/net/sock.h 2008-06-17 12:26:50.000000000 += 0800 +++ linux-2.6.26-rc6_tcp/include/net/sock.h 2008-06-17 16:43:58.0000000= 00 +0800 @@ -827,6 +827,10 @@ static inline void lock_sock(struct sock =20 extern void release_sock(struct sock *sk); =20 +/* process context need below 2 interfaces */ +#define lock_sock_bh(__sk) spin_lock_bh(&((__sk)->sk_lock.slock)) +#define unlock_sock_bh(__sk) spin_unlock_bh(&((__sk)->sk_lock.slock)) + /* BH context may only use the following locking interface. */ #define bh_lock_sock(__sk) spin_lock(&((__sk)->sk_lock.slock)) #define bh_lock_sock_nested(__sk) \ -- To unsubscribe from this list: send the line "unsubscribe linux-ia64" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html