From mboxrd@z Thu Jan  1 00:00:00 1970
From: Yevgeny Kliteynik <kliteyn-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
Subject: Re: [opensm] RFC: new routing options
Date: Tue, 12 Oct 2010 09:59:22 +0200
Message-ID: <4CB4155A.3030904@dev.mellanox.co.il>
References: <1286559644.12737.38.camel@auk31.llnl.gov>
Reply-To: kliteyn-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1255
Content-Transfer-Encoding: 7bit
Return-path: <linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
In-Reply-To: <1286559644.12737.38.camel-X2zTWyBD0EhliZ7u+bvwcg@public.gmane.org>
Sender: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
To: Albert Chu <chu11-i2BcT+NCU+M@public.gmane.org>
Cc: Sasha Khapyorsky <sashak-smomgflXvOZWk0Htik3J/w@public.gmane.org>, "linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org" <linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, Yevgeny Kliteynik <kliteyn-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
List-Id: linux-rdma@vger.kernel.org

Hi Al,

This looks really great!
One question: have you tried benchmarking the BW with up/down
routing using the guid_routing_order_file option w/o your new
features?

-- YK

On 08-Oct-10 7:40 PM, Albert Chu wrote:
> Hey Sasha,
> 
> We recently got a new cluster and I've been experimenting with some
> routing changes to improve the average bandwidth of the cluster.  They
> are attached as patches with description of the routing goals below.
> 
> We're using mpiGraph (http://sourceforge.net/projects/mpigraph/) to
> measure min, peak, and average send/recv bandwidth across the cluster.
> What we found with the original updn routing was an average of around
> 420 MB/s send bandwidth and 508 MB/s recv bandwidth.  The following two
> patches were able to get the average send bandwidth up to 1045 MB/s and
> recv bandwidth up to 1228 MB/s.
> 
> I'm sure this is only round 1 of the patches and I'm looking for
> comments.  Many areas could be cleaned up w/ some rearchitecture or
> struct changes, but I simply implemented the most non-invasive
> implementation first.  I'm also open to name changes on the options.
> 
> BTW, b/c of the old management tree on the git server, the following
> patches were developed on an internal LLNL tree.  I'll rebase after the
> up2date tree is on the openfabrics server.
> 
> 1) Port Shifting
> 
> This is similar to what was done with some of the LMC>  0 code.
> Congestion would occur due to "alignment" of routes w/ common traffic
> patterns.  However, we found that it was also necessary for LMC=0 and
> only for used-ports.  For example, lets say there are 4 ports (called A,
> B, C, D) and we are routing lids 1-9 through them.  Suppose only routing
> through A, B, and C will reach lids 1-9.
> 
> The LFT would normally be:
> 
> A: 1 4 7
> B: 2 5 8
> C: 3 6 9
> D:
> 
> The Port Shifting would make this:
> 
> A: 1 6 8
> B: 2 4 9
> C: 3 5 7
> D:
> 
> This option by itself improved the mpiGraph average send/recv bandwidth
> from 420 MB/s and 508 MB/s to to 991 MB/s and 1172 MB/s.
> 
> 2) Remote Guid Sorting
> 
> Most core/spine switches we've seen have had line boards connected to
> spine boards in a consistent pattern.  However, we recently got some
> Qlogic switches that connect from line/leaf boards to spine boards in a
> (to the casual observer) random pattern.  I'm sure there was a good
> electrical/board reason for this design, but it does hurt routing b/c
> some of the opensm routing algorithms didn't account for this
> assumption.  Here's an output from iblinkinfo as an example.
> 
> Switch 0x00066a00ec0029b8 ibcore1 L123:
>           180    1[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      254   19[  ] "ibsw55" ( )
>           180    2[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      253   19[  ] "ibsw56" ( )
>           180    3[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      258   19[  ] "ibsw57" ( )
>           180    4[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      257   19[  ] "ibsw58" ( )
>           180    5[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      256   19[  ] "ibsw59" ( )
>           180    6[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      255   19[  ] "ibsw60" ( )
>           180    7[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      261   19[  ] "ibsw61" ( )
>           180    8[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      262   19[  ] "ibsw62" ( )
>           180    9[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      260   19[  ] "ibsw63" ( )
>           180   10[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      259   19[  ] "ibsw64" ( )
>           180   11[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      284   19[  ] "ibsw65" ( )
>           180   12[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      285   19[  ] "ibsw66" ( )
>           180   13[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>     2227   19[  ] "ibsw67" ( )
>           180   14[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      283   19[  ] "ibsw68" ( )
>           180   15[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      267   19[  ] "ibsw69" ( )
>           180   16[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      270   19[  ] "ibsw70" ( )
>           180   17[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      269   19[  ] "ibsw71" ( )
>           180   18[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      268   19[  ] "ibsw72" ( )
>           180   19[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      222   17[  ] "ibcore1 S117B" ( )
>           180   20[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      209   19[  ] "ibcore1 S211B" ( )
>           180   21[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      218   21[  ] "ibcore1 S117A" ( )
>           180   22[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      192   23[  ] "ibcore1 S215B" ( )
>           180   23[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       85   15[  ] "ibcore1 S209A" ( )
>           180   24[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      182   13[  ] "ibcore1 S215A" ( )
>           180   25[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      200   11[  ] "ibcore1 S115B" ( )
>           180   26[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      129   25[  ] "ibcore1 S209B" ( )
>           180   27[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      213   27[  ] "ibcore1 S115A" ( )
>           180   28[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      197   29[  ] "ibcore1 S213B" ( )
>           180   29[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      178   28[  ] "ibcore1 S111A" ( )
>           180   30[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      215    7[  ] "ibcore1 S213A" ( )
>           180   31[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      207    5[  ] "ibcore1 S113B" ( )
>           180   32[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      212    6[  ] "ibcore1 S211A" ( )
>           180   33[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      154   33[  ] "ibcore1 S113A" ( )
>           180   34[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      194   35[  ] "ibcore1 S217B" ( )
>           180   35[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      191    3[  ] "ibcore1 S111B" ( )
>           180   36[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      219    1[  ] "ibcore1 S217A" ( )
> 
> This is a line board that connects up to spine boards (ibcore1 S*
> switches) and down to leaf/edge switches (ibsw*).  As you can see the
> line board connects to the ports on the spine switches in a random
> fashion (to the casual observer).
> 
> The "remote_guid_sorting" option will slightly tweak routing so that
> instead of finding a port to route through by searching ports 1 to N. It
> will (effectively) sort the ports based on remote connected node guid,
> then pick a port searching from lowest guid to highest guid. That way
> the routing calculations across each line/leaf board and spine switch
> will be consistent.
> 
> This patch (on top of the port_shifting one above) improved the mpiGraph
> average send/recv bandwidth from 991 MB/s&  1172 MB/s to 1045 MB/s and
> 1228 MB/s.
> 
> Al
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html