From mboxrd@z Thu Jan 1 00:00:00 1970 From: Travis Stratman Subject: data received but not detected Date: Tue, 17 Jun 2008 17:08:58 -0500 Message-ID: <1213740538.5771.192.camel@localhost.localdomain> Mime-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit To: netdev@vger.kernel.org Return-path: Received: from mail.emacinc.com ([63.245.244.68]:47378 "EHLO mail.emacinc.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756389AbYFQWKV (ORCPT ); Tue, 17 Jun 2008 18:10:21 -0400 Received: from [63.245.244.67] (helo=dhcp203.emacinc.com) by mail.emacinc.com with esmtp (Exim 4.50) id 1K8jNP-00079k-2I for netdev@vger.kernel.org; Tue, 17 Jun 2008 17:10:13 -0500 Sender: netdev-owner@vger.kernel.org List-ID: Hello, (I sent this earlier today but it doesn't look like it made it, I apologize if it gets through multiple times) I am working on an application that uses a fairly simple UDP protocol to send data between two embedded devices. I'm noticing an issue with an initial test that was written where datagrams are received but not seen by the recvfrom() call until more data arrives after it. As of right now the test case does not implement any type of lost packet protection or other flow control, which is what makes the issue so noticeable. The target for this code is a board using the Atmel AT91SAM9260 ARM processor. I have tested with 2.6.20 and 2.6.25 on this board. The test consists of a two applications with the following pseudo code (msg_size = 127, 9003/9005 are the UDP ports used): "client app" while(1) { sendto(9003, &msg_size, 4bytes); sendto(9003, buffer, msg_size); recvfrom(9005, &msg_size, 4bytes); recvfrom(9005, buffer, msg_size); } "server app" while(1) { recvfrom(9003, &msg_size, 4bytes); recvfrom(9003, buffer, msg_size); sendto(9005, &msg_size, 4bytes); sendto(9005, buffer, msg_size); } As long as the server is started first and no packets are lost or out of order, the client and server should continue indefinitely. When run between two boards on a local gigabit switch, the application will run smoothly most of the time, but I periodically see delays of 30 seconds or more where one of the applications is waiting for the second datagram to arrive before sending the next packet. Wireshark shows that the data was sent very shortly after the first datagram, and no packets are ever lost, ifconfig reports no collisions, overruns, or errors. When I run the application between two identical devices on a cross-over cable, data is transferred for a few seconds after which everything freezes until I send a ping between the two boards in the background. This forces the communication to start up again for a few seconds before they hang up again. If I insert a delay between the sendto() calls with usleep(1) (CONFIG_HZ is 100 so this could be up to 10ms) everything seems to work. Using a busy loop I was able to determine that approximately 500 us delay is required to "fix" the issue but even then I saw one hang up in several hours of testing. At first I thought that this was the "rotting packet" case that the NAPI references where an IRQ is missed on Rx, so I rewrote the poll function in the macb driver to try to fix this but I didn't see any noticeable differences. If I enable debugging in the MACB driver it slows things down enough to make everything work. Next, I tested on a Cirrus ep93xx based board (with 2.6.20) and a 133 MHz x86 board (with 2.6.14.7) and noticed the same issue when run between the target and my PC. When run between my 2.6.23 2GHz PC and another similar PC, the issue does not show up (these both use Intel NICs). I also tested on the local loopback and things worked as expected. I would very much appreciate any suggestions that anyone could give to point me in the right direction. Thanks in advance, Travis