* dual HCAs with upstream kernel
@ 2010-08-12 15:42 Hefty, Sean
[not found] ` <CF9C39F99A89134C9CF9C4CCB68B8DDF25A9687B2B-osO9UTpF0USkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
0 siblings, 1 reply; 6+ messages in thread
From: Hefty, Sean @ 2010-08-12 15:42 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Does anyone have a system with multiple HCAs that's running a recent upstream kernel?
Oracle has reported a bug connecting between two HCAs in the same system using the rdma_cm against OFED-1.5, and I'm looking for someone who might be able to quickly test to see if problem is upstream.
- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 6+ messages in thread[parent not found: <CF9C39F99A89134C9CF9C4CCB68B8DDF25A9687B2B-osO9UTpF0USkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>]
* Re: using same IP subnet on multiple interfaces (was: dual HCAs with upstream kernel) [not found] ` <CF9C39F99A89134C9CF9C4CCB68B8DDF25A9687B2B-osO9UTpF0USkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org> @ 2010-08-15 7:50 ` Or Gerlitz [not found] ` <4C679C39.8060709-hKgKHo2Ms0FWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 6+ messages in thread From: Or Gerlitz @ 2010-08-15 7:50 UTC (permalink / raw) To: Hefty, Sean Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Jason Gunthorpe, Patrick McHardy Hefty, Sean wrote: > Does anyone have a system with multiple HCAs that's running a recent upstream kernel? > Oracle has reported a bug connecting between two HCAs in the same system using the rdma_cm Sean, With 2.6.35, I was hitting the reported failure (address error event, status -ETIMEDOUT) with simpler configuration of two ports belonging to the same HCA. I used ucmatose and not rping as the former allows to specify local binding wheres the latter doesn't (see below). Next, I realized that similar test with ping(8) doesn't work either, the arp request was xmitted through one interface (ib0) and received on the other (ib1) but no reply was generated. At this point, I thought that maybe one of the arp/related sysctls could effect that, and I got an initial hit... following commit 8153a10, once I have set net.ipv4.conf.ib1.accept_local to 1 I could # ping -I ib0 to ib1's address where before that, I couldn't, ucmatose got to work either, no problem. > commit 8153a10c08f1312af563bb92532002e46d3f504a > Author: Patrick McHardy <kaber-dcUjhNyLwpNeoWH0uzbU5w@public.gmane.org> > Date: Thu Dec 3 01:25:58 2009 +0000 [...] > Change fib_validate_source() to accept packets with a local source address when > the "accept_local" sysctl is set for the incoming inet device. Combined with the > previous patches, this allows to communicate between multiple local interfaces over the wire. > # ip r s > 192.168.20.0/24 dev ib0 proto kernel scope link src 192.168.20.1 > 192.168.20.0/24 dev ib1 proto kernel scope link src 192.168.20.100 before net.ipv4.conf.ib1.accept_local was set to 1, ping isn't working > # ping -I ib0 192.168.20.100 -q & > # PING 192.168.20.100 (192.168.20.100) from 192.168.20.1 ib0: 56(84) bytes of data. > # tcpdump -ni ib0 > 10:12:14.679101 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 56 > 10:12:15.679337 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 56 > # tcpdump -ni ib1 > 10:13:35.798332 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 56 > 10:13:36.798569 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 56 > # ip n s > 192.168.20.100 dev ib0 INCOMPLETE after net.ipv4.conf.ib1.accept_local to 1, ping (and ucmatose) work, but > # tcpdump -ni ib0 > 10:29:32.196866 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 56 > 10:29:32.197047 ARP, Reply 192.168.20.100 is-at 80:00:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:02:6b:e8, length 56 > 10:29:32.197058 IP 192.168.20.1 > 192.168.20.100: ICMP echo request, id 33038, seq 1, length 64 > 10:29:32.197125 IP 192.168.20.1 > 192.168.20.100: ICMP echo request, id 33038, seq 1, length 64 > 10:29:33.197013 IP 192.168.20.1 > 192.168.20.100: ICMP echo request, id 33038, seq 2, length 64 > # tcpdump -ni ib1 > 10:29:32.196920 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 56 > 10:29:32.196944 ARP, Reply 192.168.20.100 is-at 80:00:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:02:6b:e8, length 56 > 10:29:32.197029 ARP, Reply 192.168.20.100 is-at 80:00:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:02:6b:e8, length 56 > 10:29:32.197136 IP 192.168.20.1 > 192.168.20.100: ICMP echo request, id 33038, seq 1, length 64 > 10:29:33.197023 IP 192.168.20.1 > 192.168.20.100: ICMP echo request, id 33038, seq 2, length 64 > 10:29:34.197357 IP 192.168.20.1 > 192.168.20.100: ICMP echo request, id 33038, seq 3, length 64 the echo requests go on the wire, the replies not, probably (...) internally, Patrick? I noted that the neighbour on the NIC which is replying quickly gets stale and later aged out > # ip n s > 192.168.20.100 dev ib0 lladdr 80:00:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:02:6b:e8 REACHABLE > 192.168.20.1 dev ib1 lladdr 80:00:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:02:6b:e7 STALE Or. This is my related configuration, I tried changing rp_filter to 0 but it didn't change things either > # sysctl -a | grep accept_local | grep ib[0,1] > net.ipv4.conf.ib0.accept_local = 1 > net.ipv4.conf.ib1.accept_local = 1 > # sysctl -a | grep rp_ | grep ib[0,1] > net.ipv4.conf.ib0.rp_filter = 1 > net.ipv4.conf.ib0.arp_filter = 0 > net.ipv4.conf.ib0.arp_announce = 0 > net.ipv4.conf.ib0.arp_ignore = 1 > net.ipv4.conf.ib0.arp_accept = 0 > net.ipv4.conf.ib0.arp_notify = 0 > net.ipv4.conf.ib0.proxy_arp_pvlan = 0 > net.ipv4.conf.ib1.rp_filter = 1 > net.ipv4.conf.ib1.arp_filter = 0 > net.ipv4.conf.ib1.arp_announce = 0 > net.ipv4.conf.ib1.arp_ignore = 1 > net.ipv4.conf.ib1.arp_accept = 0 > net.ipv4.conf.ib1.arp_notify = 0 > net.ipv4.conf.ib1.proxy_arp_pvlan = 0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 6+ messages in thread
[parent not found: <4C679C39.8060709-hKgKHo2Ms0FWk0Htik3J/w@public.gmane.org>]
* Re: using same IP subnet on multiple interfaces (was: dual HCAs with upstream kernel) [not found] ` <4C679C39.8060709-hKgKHo2Ms0FWk0Htik3J/w@public.gmane.org> @ 2010-08-15 16:59 ` Jason Gunthorpe [not found] ` <20100815165946.GA2861-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 6+ messages in thread From: Jason Gunthorpe @ 2010-08-15 16:59 UTC (permalink / raw) To: Or Gerlitz Cc: Hefty, Sean, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Patrick McHardy On Sun, Aug 15, 2010 at 10:50:17AM +0300, Or Gerlitz wrote: > the echo requests go on the wire, the replies not, probably (...) > internally, Patrick? What all these settings do is let a socket that is bound to a device resolve the local host's address through ARP. The socket that is bound to a device will then use its device for sending, but other sockets not bound to devices will do route lookups and use the lo device. Do: ip route get 192.168.20.100 dev ib0 ip route get 192.168.20.1 src 192.168.20.100 To see the difference in each side. To really effect a full external loopback you need to have both sides bound to their respective devices. Note that binding to a device and binding to a source IP are not the same thing in Linux. In the RDMA CM case the listening side doesn't do any IP routing operations at all so a device bind isn't necessary. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 6+ messages in thread
[parent not found: <20100815165946.GA2861-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: using same IP subnet on multiple interfaces [not found] ` <20100815165946.GA2861-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2010-08-16 15:30 ` Or Gerlitz [not found] ` <4C69597C.2040008-hKgKHo2Ms0FWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 6+ messages in thread From: Or Gerlitz @ 2010-08-16 15:30 UTC (permalink / raw) To: Jason Gunthorpe, Hefty, Sean Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Patrick McHardy Jason Gunthorpe wrote: > [...] The socket that is bound to a device will then use its device for sending, > but other sockets not bound to devices will do route lookups and use the lo device. > Do: [...] To see the difference in each side. sure, makes sense, the ping-reply code does route lookup and will use the loopback device. I took a 2nd look on ping w.r.t to various sysctl states, and when rp_filter is set to its default > # sysctl -a | grep -wE "accept_local|rp_filter|arp_ignore" | grep ib > net.ipv4.conf.ib0.rp_filter = 1 > net.ipv4.conf.ib0.accept_local = 1 > net.ipv4.conf.ib0.arp_ignore = 1 > net.ipv4.conf.ib1.rp_filter = 1 > net.ipv4.conf.ib1.accept_local = 1 > net.ipv4.conf.ib1.arp_ignore = 1 ping isn't working since there's no arp reply > # ping -I ib0 192.168.20.100 > PING 192.168.20.100 (192.168.20.100) from 192.168.20.1 ib0: 56(84) bytes of data. > From 192.168.20.1 icmp_seq=2 Destination Host Unreachable > From 192.168.20.1 icmp_seq=3 Destination Host Unreachable > From 192.168.20.1 icmp_seq=4 Destination Host Unreachable > # tcpdump -ni ib0 > 18:04:39.492306 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 56 > 18:04:40.492541 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 56 > # tcpdump -ni ib1 > 18:04:42.497039 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 56 > 18:04:43.497268 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 56 Once I'm setting net.ipv4.conf.ib1.rp_filter=0 arps replies are generated and ping is working as you explained, echo-request externally, echo-reply internally > # tcpdump -ni ib1 > 18:06:33.103248 ARP, Request who-has 192.168.20.100 tell 192.168.20.1, length 56 > 18:06:33.103281 ARP, Reply 192.168.20.100 is-at 80:00:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:02:6b:e8, length 56 > 18:06:33.103369 ARP, Reply 192.168.20.100 is-at 80:00:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:02:6b:e8, length 56 > 18:06:33.103461 IP 192.168.20.1 > 192.168.20.100: ICMP echo request, id 26906, seq 1, length 64 > 18:06:34.107465 IP 192.168.20.1 > 192.168.20.100: ICMP echo request, id 26906, seq 2, length 64 Now, If I return rp_filter to 1, ping keeps working using the neighbour previously created. ping even keeps working when I set net.ipv4.conf.ib1.accept_local to 0, which is a bit weird unless this sysctl is made to act in the neigbour level (i.e control arp replies and not any packet xmit). > To really effect a full external loopback you need to have both sides > bound to their respective devices. Note that binding to a device and > binding to a source IP are not the same thing in Linux. Even without being fully into the details of what does binding to a source IP actually translates to, I understand there's a difference. > In the RDMA CM case the listening side doesn't do any IP > routing operations at all so a device bind isn't necessary. Yes, indeed. As for the active side, the RDMA CM doesn't have a BINDTODEVICE equivalent. As for the original issue we were discussing here, Sean - the conclusion is that with upstream 2.6.35 bits for the rdma connection to go from hca1 port1 to hca1 port2 (or from hca1 port1 to hca2 port1), the rdma-cm needs a neighbour, similarly to a ping -I ib0 to ib1 address. A neighbour isn't created unless the responding NIC (ib1 in my example) has both rp_filter set to 0 and accept_local set to 1, Jason, does this makes sense? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 6+ messages in thread
[parent not found: <4C69597C.2040008-hKgKHo2Ms0FWk0Htik3J/w@public.gmane.org>]
* Re: using same IP subnet on multiple interfaces [not found] ` <4C69597C.2040008-hKgKHo2Ms0FWk0Htik3J/w@public.gmane.org> @ 2010-08-17 3:19 ` Jason Gunthorpe [not found] ` <20100817031945.GA5251-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 6+ messages in thread From: Jason Gunthorpe @ 2010-08-17 3:19 UTC (permalink / raw) To: Or Gerlitz Cc: Hefty, Sean, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Patrick McHardy On Mon, Aug 16, 2010 at 06:30:04PM +0300, Or Gerlitz wrote: > As for the original issue we were discussing here, Sean - the > conclusion is that with upstream 2.6.35 bits for the rdma connection > to go from hca1 port1 to hca1 port2 (or from hca1 port1 to hca2 > port1), the rdma-cm needs a neighbour, similarly to a ping -I ib0 to > ib1 address. > > A neighbour isn't created unless the responding NIC (ib1 in my > example) has both rp_filter set to 0 and accept_local set to 1, > Jason, does this makes sense? This description seemed reasonable to me. It is pretty confusing what binding means in RDMA CM, it is different then sockets, and is some combination of SO_BINDTODEVICE and bind to address. Also, you might find the fixes that were done lately for IPv6 tidied up some of the general routing and device select stuff that becomes noticable when you start doing funny routing things like this. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 6+ messages in thread
[parent not found: <20100817031945.GA5251-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: using same IP subnet on multiple interfaces [not found] ` <20100817031945.GA5251-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2010-08-18 6:02 ` Or Gerlitz 0 siblings, 0 replies; 6+ messages in thread From: Or Gerlitz @ 2010-08-18 6:02 UTC (permalink / raw) To: Jason Gunthorpe Cc: Hefty, Sean, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Jason Gunthorpe wrote: >> As for the original issue we were discussing here, the conclusion is that with upstream 2.6.35 bits for the rdma connection to go from hca1 port1 to hca1 port2 (or from hca1 port1 to hca2 port1), the rdma-cm needs a neighbour, similarly to a ping -I ib0 to ib1 address. A neighbour isn't created unless the responding NIC (ib1 in my example) has both rp_filter set to 0 and accept_local set to 1, >> does this makes sense? > This description seemed reasonable to me. It is pretty confusing what binding means in RDMA CM, it is different then sockets, and is some combination of SO_BINDTODEVICE and bind to address. I was thinking that one of the things taken care by the patch set to addr.c/cma.c you, David and Sean did last year was to make binding in rdma-cm to be bind to address by-the-book, in what aspect it is different now? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2010-08-18 6:02 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-08-12 15:42 dual HCAs with upstream kernel Hefty, Sean
[not found] ` <CF9C39F99A89134C9CF9C4CCB68B8DDF25A9687B2B-osO9UTpF0USkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2010-08-15 7:50 ` using same IP subnet on multiple interfaces (was: dual HCAs with upstream kernel) Or Gerlitz
[not found] ` <4C679C39.8060709-hKgKHo2Ms0FWk0Htik3J/w@public.gmane.org>
2010-08-15 16:59 ` Jason Gunthorpe
[not found] ` <20100815165946.GA2861-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2010-08-16 15:30 ` using same IP subnet on multiple interfaces Or Gerlitz
[not found] ` <4C69597C.2040008-hKgKHo2Ms0FWk0Htik3J/w@public.gmane.org>
2010-08-17 3:19 ` Jason Gunthorpe
[not found] ` <20100817031945.GA5251-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2010-08-18 6:02 ` Or Gerlitz
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox