public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
From: Yevgeny Kliteynik <kliteyn-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
To: Albert Chu <chu11-i2BcT+NCU+M@public.gmane.org>
Cc: Sasha Khapyorsky <sashak-smomgflXvOZWk0Htik3J/w@public.gmane.org>,
	"linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Yevgeny Kliteynik
	<kliteyn-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
Subject: Re: [opensm] RFC: new routing options
Date: Tue, 12 Oct 2010 09:59:22 +0200	[thread overview]
Message-ID: <4CB4155A.3030904@dev.mellanox.co.il> (raw)
In-Reply-To: <1286559644.12737.38.camel-X2zTWyBD0EhliZ7u+bvwcg@public.gmane.org>

Hi Al,

This looks really great!
One question: have you tried benchmarking the BW with up/down
routing using the guid_routing_order_file option w/o your new
features?

-- YK

On 08-Oct-10 7:40 PM, Albert Chu wrote:
> Hey Sasha,
> 
> We recently got a new cluster and I've been experimenting with some
> routing changes to improve the average bandwidth of the cluster.  They
> are attached as patches with description of the routing goals below.
> 
> We're using mpiGraph (http://sourceforge.net/projects/mpigraph/) to
> measure min, peak, and average send/recv bandwidth across the cluster.
> What we found with the original updn routing was an average of around
> 420 MB/s send bandwidth and 508 MB/s recv bandwidth.  The following two
> patches were able to get the average send bandwidth up to 1045 MB/s and
> recv bandwidth up to 1228 MB/s.
> 
> I'm sure this is only round 1 of the patches and I'm looking for
> comments.  Many areas could be cleaned up w/ some rearchitecture or
> struct changes, but I simply implemented the most non-invasive
> implementation first.  I'm also open to name changes on the options.
> 
> BTW, b/c of the old management tree on the git server, the following
> patches were developed on an internal LLNL tree.  I'll rebase after the
> up2date tree is on the openfabrics server.
> 
> 1) Port Shifting
> 
> This is similar to what was done with some of the LMC>  0 code.
> Congestion would occur due to "alignment" of routes w/ common traffic
> patterns.  However, we found that it was also necessary for LMC=0 and
> only for used-ports.  For example, lets say there are 4 ports (called A,
> B, C, D) and we are routing lids 1-9 through them.  Suppose only routing
> through A, B, and C will reach lids 1-9.
> 
> The LFT would normally be:
> 
> A: 1 4 7
> B: 2 5 8
> C: 3 6 9
> D:
> 
> The Port Shifting would make this:
> 
> A: 1 6 8
> B: 2 4 9
> C: 3 5 7
> D:
> 
> This option by itself improved the mpiGraph average send/recv bandwidth
> from 420 MB/s and 508 MB/s to to 991 MB/s and 1172 MB/s.
> 
> 2) Remote Guid Sorting
> 
> Most core/spine switches we've seen have had line boards connected to
> spine boards in a consistent pattern.  However, we recently got some
> Qlogic switches that connect from line/leaf boards to spine boards in a
> (to the casual observer) random pattern.  I'm sure there was a good
> electrical/board reason for this design, but it does hurt routing b/c
> some of the opensm routing algorithms didn't account for this
> assumption.  Here's an output from iblinkinfo as an example.
> 
> Switch 0x00066a00ec0029b8 ibcore1 L123:
>           180    1[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      254   19[  ] "ibsw55" ( )
>           180    2[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      253   19[  ] "ibsw56" ( )
>           180    3[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      258   19[  ] "ibsw57" ( )
>           180    4[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      257   19[  ] "ibsw58" ( )
>           180    5[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      256   19[  ] "ibsw59" ( )
>           180    6[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      255   19[  ] "ibsw60" ( )
>           180    7[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      261   19[  ] "ibsw61" ( )
>           180    8[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      262   19[  ] "ibsw62" ( )
>           180    9[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      260   19[  ] "ibsw63" ( )
>           180   10[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      259   19[  ] "ibsw64" ( )
>           180   11[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      284   19[  ] "ibsw65" ( )
>           180   12[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      285   19[  ] "ibsw66" ( )
>           180   13[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>     2227   19[  ] "ibsw67" ( )
>           180   14[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      283   19[  ] "ibsw68" ( )
>           180   15[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      267   19[  ] "ibsw69" ( )
>           180   16[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      270   19[  ] "ibsw70" ( )
>           180   17[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      269   19[  ] "ibsw71" ( )
>           180   18[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      268   19[  ] "ibsw72" ( )
>           180   19[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      222   17[  ] "ibcore1 S117B" ( )
>           180   20[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      209   19[  ] "ibcore1 S211B" ( )
>           180   21[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      218   21[  ] "ibcore1 S117A" ( )
>           180   22[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      192   23[  ] "ibcore1 S215B" ( )
>           180   23[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       85   15[  ] "ibcore1 S209A" ( )
>           180   24[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      182   13[  ] "ibcore1 S215A" ( )
>           180   25[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      200   11[  ] "ibcore1 S115B" ( )
>           180   26[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      129   25[  ] "ibcore1 S209B" ( )
>           180   27[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      213   27[  ] "ibcore1 S115A" ( )
>           180   28[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      197   29[  ] "ibcore1 S213B" ( )
>           180   29[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      178   28[  ] "ibcore1 S111A" ( )
>           180   30[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      215    7[  ] "ibcore1 S213A" ( )
>           180   31[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      207    5[  ] "ibcore1 S113B" ( )
>           180   32[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      212    6[  ] "ibcore1 S211A" ( )
>           180   33[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      154   33[  ] "ibcore1 S113A" ( )
>           180   34[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      194   35[  ] "ibcore1 S217B" ( )
>           180   35[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      191    3[  ] "ibcore1 S111B" ( )
>           180   36[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      219    1[  ] "ibcore1 S217A" ( )
> 
> This is a line board that connects up to spine boards (ibcore1 S*
> switches) and down to leaf/edge switches (ibsw*).  As you can see the
> line board connects to the ports on the spine switches in a random
> fashion (to the casual observer).
> 
> The "remote_guid_sorting" option will slightly tweak routing so that
> instead of finding a port to route through by searching ports 1 to N. It
> will (effectively) sort the ports based on remote connected node guid,
> then pick a port searching from lowest guid to highest guid. That way
> the routing calculations across each line/leaf board and spine switch
> will be consistent.
> 
> This patch (on top of the port_shifting one above) improved the mpiGraph
> average send/recv bandwidth from 991 MB/s&  1172 MB/s to 1045 MB/s and
> 1228 MB/s.
> 
> Al
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  parent reply	other threads:[~2010-10-12  7:59 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-10-08 17:40 [opensm] RFC: new routing options Albert Chu
     [not found] ` <1286559644.12737.38.camel-X2zTWyBD0EhliZ7u+bvwcg@public.gmane.org>
2010-10-12  7:59   ` Yevgeny Kliteynik [this message]
     [not found]     ` <4CB4155A.3030904-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2010-10-12 16:43       ` Albert Chu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4CB4155A.3030904@dev.mellanox.co.il \
    --to=kliteyn-ldsdmyg8hgv8yrgs2mwiifqbs+8scbdb@public.gmane.org \
    --cc=chu11-i2BcT+NCU+M@public.gmane.org \
    --cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=sashak-smomgflXvOZWk0Htik3J/w@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox