From mboxrd@z Thu Jan 1 00:00:00 1970 From: Joanna Rutkowska Subject: The strange case of xen_netback not returning ARP replies Date: Wed, 16 May 2012 14:18:27 +0200 Message-ID: <4FB39B13.70707@invisiblethingslab.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============4412883461917978488==" Return-path: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: "xen-devel@lists.xensource.com" Cc: Marek Marczykowski List-Id: xen-devel@lists.xenproject.org This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --===============4412883461917978488== Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enig25EA00E6119971304D270B7A" This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enig25EA00E6119971304D270B7A Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hello, I'm facing a rather strange problem with the netback interface. My setup involves a netvm, which has some physical network interfaces assigned, and a client VM where a net front is running (exposed as eth0) and which is connected to that netvm (via vif42.0 interface, as seen in the netvm on the dumps below). Now, the netvm has two physical network interfaces assigned: 1) A standard Intel AGN (iwlwifi module, interface wlan0) -- this is just a PCI devices assigned 2) A USB 3G modem (cdc_ncm module, usb0 interface) -- this has been made available to the netvm by assigning a whole USB controller, where the 3G modem is connected to. This works fine. We do NAT in netvm for the traffic coming on vif* and send it out through the default outgoing interface, e.g. wlan0. Now, as long as I use the wlan0 for networking all works great. I've been using this setup for years, no problem here. However, when I switch to usb0 as a default outgoing interface in the netvm, something strange happens. The networking works fine via usb0 for some time (a few minutes typically), yet suddenly, after enough packets got exchanged, the networking stops working. When I run tcpdump on the vif* interface I can see that suddenly there is nobody (in the netvm) to reply for the ARP requests from the client VM (the client vm has Xen ID =3D 42 in this dump, and IP =3D .5, and gate= way =3D .1): [root@netvm user]# tcpdump -ni vif42.0 arp tcpdump: WARNING: vif42.0: no IPv4 address assigned tcpdump: verbose output suppressed, use -v or -vv for full protocol decod= e listening on vif42.0, link-type EN10MB (Ethernet), capture size 65535 byt= es 13:41:55.031819 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 2= 8 13:41:56.031860 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 2= 8 13:41:57.031794 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 2= 8 13:41:59.287308 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 2= 8 13:42:00.283853 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 2= 8 13:42:01.283816 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 2= 8 13:42:03.231324 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length =2E.. and this now continues until no end. For comparison, this is how it looks when I use networking via wlan0: [root@netvm user]# tcpdump -ni vif42.0 arp tcpdump: WARNING: vif42.0: no IPv4 address assigned tcpdump: verbose output suppressed, use -v or -vv for full protocol decod= e listening on vif42.0, link-type EN10MB (Ethernet), capture size 65535 byt= es 13:39:00.215883 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 2= 8 13:39:00.215911 ARP, Reply 10.137.1.1 is-at fe:ff:ff:ff:ff:ff, length 28 13:39:21.799844 ARP, Request who-has 10.137.1.1 tell 10.137.1.5, length 2= 8 13:39:21.799869 ARP, Reply 10.137.1.1 is-at fe:ff:ff:ff:ff:ff, length 28 We can see that every once in a while an ARP request for 10.137.1.1 appears (a gateway for clientvm, so the netvm), yet this is immediately being answered (by netvm, as I understand). Now, this behavior seems really strange, because: 1) AFAIU, the ARP replies are/should be generated by the netback interface in the netvm (vif*). 2) It shouldn't matter, for the netback code, how the packets are later routed (via wlan0 vs. usb0) to provide this (dummy) arp response? 3) ...yet, for some reason, in the case when packets are later routed through usb0, the netback is not willing to generate arp response??? Or am I misunderstanding this, and it is somebody else who is generating the arp responses? The final NIC? Some additional notes: 1) We make sure to set /proc/sys/net/ipv4/conf/vif*/proxy_arp to 1 2) When this "arp hang" happens, the networking (via usb0) is still working fine in the netvm (i.e. I can do ping google.com from the netvm) 3) This has been tested on various VM kernels (in the netvm): 3.0.4, 3.2.7, and 3.3.5 -- all exhibit the same behavior. 4) Nothing spectacular in the logs of the netvm, however, I can often see this crash in the *client* VM: [ 1257.228761] ------------[ cut here ]------------ [ 1257.228767] WARNING: at /home/user/qubes-src/kernel/kernel-3.3.5/linux-3.3.5/fs/sysfs/file.c:498 sysfs_attr_ns+0x93/0xa0() [ 1257.228776] sysfs: kobject eth0 without dirent [ 1257.228780] Modules linked in: iptable_raw bnep bluetooth rfkill ipt_MASQUERADE ipt_REJECT xt_state xt_tcpudp xen_netback iptable_filter iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables x_tables xen_netfront microcode pcspkr u2mfn(O) xen_blkback xen_evtchn autofs4 ext4 jbd2 crc16 dm_snapshot xen_blkfront [last unloaded: scsi_wait_scan] [ 1257.228819] Pid: 11, comm: xenwatch Tainted: G W O 3.3.5-1.pvops.qubes.x86_64 #1 [ 1257.228825] Call Trace: [ 1257.228830] [] warn_slowpath_common+0x7a/0xb0 [ 1257.228836] [] warn_slowpath_fmt+0x41/0x50 [ 1257.228842] [] ? lock_timer_base+0x37/0x70 [ 1257.228850] [] sysfs_attr_ns+0x93/0xa0 [ 1257.228856] [] sysfs_remove_file+0x1f/0x40 [ 1257.228862] [] device_remove_file+0x12/0x20 [ 1257.228870] [] xennet_remove+0x84/0xac [xen_netfron= t] [ 1257.228875] [] xenbus_dev_remove+0x42/0xa0 [ 1257.228881] [] __device_release_driver+0x77/0xd0 [ 1257.228887] [] device_release_driver+0x28/0x40 [ 1257.228895] [] bus_remove_device+0x10f/0x180 [ 1257.228901] [] device_del+0x118/0x1c0 [ 1257.228906] [] device_unregister+0x1d/0x60 [ 1257.228914] [] xenbus_dev_changed+0x96/0x1b0 [ 1257.228920] [] frontend_changed+0x24/0x50 [ 1257.228926] [] xenwatch_thread+0xb1/0x170 [ 1257.228933] [] ? wake_up_bit+0x40/0x40 [ 1257.228939] [] ? xenbus_thread+0x40/0x40 [ 1257.228944] [] kthread+0x96/0xa0 [ 1257.228951] [] kernel_thread_helper+0x4/0x10 [ 1257.228959] [] ? retint_restore_args+0x5/0x6 [ 1257.228964] [] ? gs_change+0x13/0x13 [ 1257.228968] ---[ end trace 75286ef58ce0391f ]--- But this seems rather irrelevant, as it seems like it is the netvm that is failing here, i.e. it doesn't generate ARP responses? I would appreciate any help with this issue! Thanks, joanna. --------------enig25EA00E6119971304D270B7A Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- iQEcBAEBAgAGBQJPs5sTAAoJEDaIqHeRBUM0v1YIAOGfmXQNk/8dDdwBpAe/tMf5 7BdpFdF3bZwyN9AvNFnN0gsdsY2aPMQV2WHna4mr25k1o3DJyZCXrjltZdIw7RJS D8V6t4cW4J6qTddaSWQQrK/5ftVbIeN5MsNYsmJfWEb3eayuuGFQAD1Rfi70LRCP LtB+K5fzkROBomkOglaSNtG+LtH3OMWEW5P0+FkN1aQqXsWwmYO7UX/Rzo0G/uOo /7WkR3SysEpAaTHF0UEmZdGkuPxPrUfATGJT7T/yeBr1iw/1NYjMKMucwxWTVrJ/ YT+OtUrXZzxlOQ+13OA72vXYTCHXNW6UuTI/NYU1xhGyhIjGgbQHSuCpCRwLiOU= =+ufZ -----END PGP SIGNATURE----- --------------enig25EA00E6119971304D270B7A-- --===============4412883461917978488== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============4412883461917978488==--