From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <4CD14B1E.4000707@domain.hid> Date: Wed, 03 Nov 2010 12:44:30 +0100 From: Anders Blomdell MIME-Version: 1.0 References: <4CC82C8D.3080808@domain.hid> <4CC84327.9070202@domain.hid> <4CC92786.3030509@domain.hid> <4CC92902.4040904@domain.hid> <4CC943A2.9020806@domain.hid> <4CC94E0B.9070106@domain.hid> <4CCEF104.7050409@domain.hid> <4CD11AB1.8090407@domain.hid> <4CD13A70.8040702@domain.hid> In-Reply-To: <4CD13A70.8040702@domain.hid> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Xenomai-core] Potential problem with rt_eepro100 List-Id: Xenomai life and development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Jan Kiszka Cc: xenomai@xenomai.org Anders Blomdell wrote: > Jan Kiszka wrote: >> Am 01.11.2010 17:55, Anders Blomdell wrote: >>> Jan Kiszka wrote: >>>> Am 28.10.2010 11:34, Anders Blomdell wrote: >>>>> Jan Kiszka wrote: >>>>>> Am 28.10.2010 09:34, Anders Blomdell wrote: >>>>>>> Anders Blomdell wrote: >>>>>>>> Anders Blomdell wrote: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I'm trying to use rt_eepro100, for sending raw ethernet packets, >>>>>>>>> but I'm >>>>>>>>> experincing occasionally weird behaviour. >>>>>>>>> >>>>>>>>> Versions of things: >>>>>>>>> >>>>>>>>> linux-2.6.34.5 >>>>>>>>> xenomai-2.5.5.2 >>>>>>>>> rtnet-39f7fcf >>>>>>>>> >>>>>>>>> The testprogram runs on two computers with "Intel Corporation >>>>>>>>> 82557/8/9/0/1 Ethernet Pro 100 (rev 08)" controller, where one >>>>>>>>> computer >>>>>>>>> acts as a mirror sending back packets received from the ethernet >>>>>>>>> (only >>>>>>>>> those two computers on the network), and the other sends >>>>>>>>> packets and >>>>>>>>> measures roundtrip time. Most packets comes back in approximately >>>>>>>>> 100 >>>>>>>>> us, but occasionally the reception times out (once in about 100000 >>>>>>>>> packets or more), but the packets gets immediately received when >>>>>>>>> reception is retried, which might indicate a race between >>>>>>>>> rt_dev_recvmsg >>>>>>>>> and interrupt, but I might miss something obvious. >>>>>>>> Changing one of the ethernet cards to a "Intel Corporation 82541PI >>>>>>>> Gigabit Ethernet Controller (rev 05)", while keeping everything >>>>>>>> else >>>>>>>> constant, changes behavior somewhat; after receiving a few 100000 >>>>>>>> packets, reception stops entirely (-EAGAIN is returned), while >>>>>>>> transmission proceeds as it should (and mirror returns packets). >>>>>>>> >>>>>>>> Any suggestions on what to try? >>>>>>> Since the problem disappears with 'maxcpus=1', I suspect I have a >>>>>>> SMP >>>>>>> issue (machine is a Core2 Quad), so I'll move to xenomai-core. >>>>>>> (original message can be found at >>>>>>> http://sourceforge.net/mailarchive/message.php?msg_name=4CC82C8D.3080808%40control.lth.se >>>>>>> >>>>>>> >>>>>>> ) >>>>>>> >>>>>>> Xenomai-core gurus: which is the corrrect way to debug SMP issues? >>>>>>> Can I run I-pipe-tracer and expect to be able save at least 150 >>>>>>> us of >>>>>>> traces for all cpus? Any hints/suggestions/insigths are welcome... >>>>>> The i-pipe tracer unfortunately only saves traces for a the CPU that >>>>>> triggered the freeze. To have a full pictures, you may want to try my >>>>>> ftrace port I posted recently for 2.6.35. >>>>> 2.6.35.7 ? >>>>> >>>> Exactly. >>> Finally managed to get the ftrace to work >>> (one possible bug: had to manually copy >>> include/xenomai/trace/xn_nucleus.h to >>> include/xenomai/trace/events/xn_nucleus.h), and it looks like it can be >>> very useful... >>> >>> But I don't think it will give much info at the moment, since no >>> xenomai/ipipe interrupt activity shows up, and adding that is far above >>> my league :-( >> >> You could use the function tracer, provided you are able to stop the >> trace quickly enough on error. >> >>> My current theory is that the problem occurs when something like this >>> takes place: >>> >>> CPU-i CPU-j CPU-k CPU-l >>> >>> rt_dev_sendmsg >>> xmit_irq >>> rt_dev_recvmsg recv_irq >> >> Can't follow. When races here, and what will go wrong then? > Thats the good question. Find attached: > > 1. .config (so you can check for stupid mistakes) > 2. console log > 3. latest version of test program > 4. tail of ftrace dump > > These are the xenomai tasks running when the test program is active: > > CPU PID CLASS PRI TIMEOUT TIMEBASE STAT NAME > 0 0 idle -1 - master R ROOT/0 > 1 0 idle -1 - master R ROOT/1 > 2 0 idle -1 - master R ROOT/2 > 3 0 idle -1 - master R ROOT/3 > 0 0 rt 98 - master W rtnet-stack > 0 0 rt 0 - master W rtnet-rtpc > 0 29901 rt 50 - master raw_test > 0 29906 rt 0 - master X reporter > > > > The lines of interest from the trace are probably: > > [003] 2061.347855: xn_nucleus_thread_resume: thread=f9bf7b00 > thread_name=rtnet-stack mask=2 > [003] 2061.347862: xn_nucleus_sched: status=2000000 > [000] 2061.347866: xn_nucleus_sched_remote: status=0 > > since this is the only place where a packet gets delayed, and the only > place in the trace where sched_remote reports a status=0 Since the cpu that has rtnet-stack and hence should be resumed is doing heavy I/O at the time of fault; could it be that send_ipi/schedule_handler needs barriers to make sure taht decisions are made on the right status? /Anders