From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stephen Warren Date: Mon, 09 Feb 2015 10:38:52 -0700 Subject: [U-Boot] netconsole: USB Ethernet connection dropping with ping or tftpboot In-Reply-To: <1423430724.2889.41.camel@posteo.de> References: <1422999858.2688.14.camel@posteo.de> <1423135265.26058.11.camel@posteo.de> <54D38D2E.50102@wwwdotorg.org> <1423174204.1232.25.camel@posteo.de> <54D3ED62.7030208@wwwdotorg.org> <1423184798.1232.63.camel@posteo.de> <54D502BF.7070401@wwwdotorg.org> <1423430724.2889.41.camel@posteo.de> Message-ID: <54D8F0AC.20103@wwwdotorg.org> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: u-boot@lists.denx.de On 02/08/2015 02:25 PM, J?rg Krause wrote: > On Fr, 2015-02-06 at 11:06 -0700, Stephen Warren wrote: >> On 02/05/2015 06:06 PM, J?rg Krause wrote: >>> On Do, 2015-02-05 at 15:23 -0700, Stephen Warren wrote: >>>> >>>> b) In ci_bounce(), the bounce buffer is only allocated if the >>>> user-buffer is already aligned, and if a large-enough bounce buffer >>>> wasn't previously allocated. If ci_req->b_buf was uninitialized it could >>>> be non-zero (thus preventing the expected aligned allocation) yet not >>>> actually aligned enough. >>> >>> I can reproduce this issue now. After some "timeout sending packets to >>> usb ethernet" messages, the bounce buffer somehow gets corrupted. >>> ci_bounce() is called with an unaligned input buffer length >>> 'req->length=66', but the bounce buffer length >>> 'ci_req->b_len=1140305940' or in hex 'ci_req->b_len=0x43f7b014'. This >>> bounce buffer length is obviously an address, as the following >>> misaligned error message shows: "CACHE: Misaligned operation at range >>> [43f7b010, 43f7b070]". >> >> Ah, I hadn't realized that was [start, length] rather than [start, end]. >> >> The question is: How is ci_req->b_len getting corrupted? Is it simply >> never initialized, or does something trash that value later? >> >> ci_ep_alloc_request() appears to calloc() the whole struct ci_req, so I >> imagine an initialization/allocating error isn't happening. >> >> The only issue there might be some code somehow creating its own struct >> usb_request instead of calling into the controller's ->alloc_request() >> function. I vaguely recall fixing some of those, but might have missed >> some in protocols that I didn't test (i.e. anything other than USB Mass >> Storage or DFU, although I might have very briefly tested netconsole once?). >> >> I would suggest adding a whole ton of printfs() to catch where ci_reqs >> are being allocated, and where ci_req->b_len is getting written in which >> ci_req objects, and then mapping that back to the ci_req that the cache >> alignment error message complains about. Sorry, this will be a bit painful. >> >> If the ci_req is always at the same address on different boots of the >> code, that will make it easier, especially if you have a debugger with a >> data watchpoint, or can write some code to use any data watchpoint >> self-hosted debug capability in your CPU. > > I think I found the answer. > > I used a lot of debug messages and tried to understand the involved > drivers. I will try to sum up my investigations. > > NetLoop is entered first for the ping protocol and then for the > netconsole protocol: > --- NetLoop Entry (PING) > --- NetLoop Entry (NETCONSOLE) > > ci_udc is probed and ci_ep_alloc_request() is called for EP0: > ci_ep_alloc_request usb_ep:0x43fd0028 ci_req:0x43b83220 > > In eth_bind() ci_ep_alloc_request() is called for status EP: > ci_ep_alloc_request usb_ep:0x43fd00a0 ci_req:0x43b7b4a8 > > Everything is fine: > using ci_udc, OUT ep- IN ep- STATUS ep- > MAC 00:19:b8:00:00:02 > HOST MAC 00:19:b8:00:00:01 > > In eth_set_config() ci_ep_alloc_request() is invoked again for tx_req > and rx_req: > ci_ep_alloc_request usb_ep:0x43fd0050 ci_req:0x43b7b568 > ci_ep_alloc_request usb_ep:0x43fd0078 ci_req:0x43b7b5b0 > > Everything is fine: > high speed config #1: 2 mA, Ethernet Gadget, using CDC Ethernet > USB network up! > > Now the NetLoop is available for netconsole: > --- NetState set to 0 (CONTINUE) > --- NetLoop Init (NETCONSOLE) > --- NetLoop ARP handler set (43fa3b14) > [..] > Got ARP REPLY, set eth addr (00:19:b8:00:00:01) > --- NetState set to 2 (SUCCESS) > --- NetLoop UDP handler set (00000000) > --- NetLoop ARP handler set (00000000) > --- NetLoop timeout handler cancelled > --- NetLoop Success! (NETCONSOLE) > > Now it's pings turn in the NetLoop. The if (eth_is_on_demand_init()) > branch is executed. eth_halt() and later eth_disconnect() and > eth_reset_config() are invoked. There the in and out EPs are freed: > ci_ep_free_request ci_req:0x43b7b568 > ci_ep_free_request ci_req:0x43b7b5b0 That sounds like a problem. I would assume that once netconsole is enabled, the Ethernet connection is no longer on-demand, but should be initialized once, and then never taken down. Take a look at the definition of eth_is_on_demand_init() in include/net.h. It has a special case for CONFIG_NETCONSOLE, based on variable net_loop_last_protocol. Perhaps that isn't getting set correctly, or perhaps the variable is getting set to something other than netconsole, thus causing the connection to get taken down? > However, netconsole tries to send every printf() as an UDP packet to the > host. > sending UDP to 10.0.0.1/00:19:b8:00:00:01 > > But the usb_request [1] pointer in usb_eth_send is NULL after the free > request: > usb_eth_send usb_request: 00000000 length: 82 > > usb_eth_send() calls ci_ep_queue() which calls ci_bounce() and because > of the corrupted ci_req pointer I get a misaligned cache. > > So, as far as I understand, the main problem is that netconsole does not > knows that the connection is disconnected by the NetLoop? > > What are your suggestions to this issue? Should we add an addition check > for an initialized usb_request pointer in usb_eth_send()? Or somewhere > else? It might be a useful debugging/validation aid to add a check somewhere for valid non-free struct usb_requests. However, it sounds like that's just a symptom; the underlying problem is that the Ethernet connection is taken down at all, while netconsole is active.