From: Yevgeny Kliteynik <kliteyn-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
To: Albert Chu <chu11-i2BcT+NCU+M@public.gmane.org>
Cc: Sasha Khapyorsky <sashak-smomgflXvOZWk0Htik3J/w@public.gmane.org>,
"linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
<linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
Yevgeny Kliteynik
<kliteyn-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
Subject: Re: [opensm] RFC: new routing options
Date: Tue, 12 Oct 2010 09:59:22 +0200 [thread overview]
Message-ID: <4CB4155A.3030904@dev.mellanox.co.il> (raw)
In-Reply-To: <1286559644.12737.38.camel-X2zTWyBD0EhliZ7u+bvwcg@public.gmane.org>
Hi Al,
This looks really great!
One question: have you tried benchmarking the BW with up/down
routing using the guid_routing_order_file option w/o your new
features?
-- YK
On 08-Oct-10 7:40 PM, Albert Chu wrote:
> Hey Sasha,
>
> We recently got a new cluster and I've been experimenting with some
> routing changes to improve the average bandwidth of the cluster. They
> are attached as patches with description of the routing goals below.
>
> We're using mpiGraph (http://sourceforge.net/projects/mpigraph/) to
> measure min, peak, and average send/recv bandwidth across the cluster.
> What we found with the original updn routing was an average of around
> 420 MB/s send bandwidth and 508 MB/s recv bandwidth. The following two
> patches were able to get the average send bandwidth up to 1045 MB/s and
> recv bandwidth up to 1228 MB/s.
>
> I'm sure this is only round 1 of the patches and I'm looking for
> comments. Many areas could be cleaned up w/ some rearchitecture or
> struct changes, but I simply implemented the most non-invasive
> implementation first. I'm also open to name changes on the options.
>
> BTW, b/c of the old management tree on the git server, the following
> patches were developed on an internal LLNL tree. I'll rebase after the
> up2date tree is on the openfabrics server.
>
> 1) Port Shifting
>
> This is similar to what was done with some of the LMC> 0 code.
> Congestion would occur due to "alignment" of routes w/ common traffic
> patterns. However, we found that it was also necessary for LMC=0 and
> only for used-ports. For example, lets say there are 4 ports (called A,
> B, C, D) and we are routing lids 1-9 through them. Suppose only routing
> through A, B, and C will reach lids 1-9.
>
> The LFT would normally be:
>
> A: 1 4 7
> B: 2 5 8
> C: 3 6 9
> D:
>
> The Port Shifting would make this:
>
> A: 1 6 8
> B: 2 4 9
> C: 3 5 7
> D:
>
> This option by itself improved the mpiGraph average send/recv bandwidth
> from 420 MB/s and 508 MB/s to to 991 MB/s and 1172 MB/s.
>
> 2) Remote Guid Sorting
>
> Most core/spine switches we've seen have had line boards connected to
> spine boards in a consistent pattern. However, we recently got some
> Qlogic switches that connect from line/leaf boards to spine boards in a
> (to the casual observer) random pattern. I'm sure there was a good
> electrical/board reason for this design, but it does hurt routing b/c
> some of the opensm routing algorithms didn't account for this
> assumption. Here's an output from iblinkinfo as an example.
>
> Switch 0x00066a00ec0029b8 ibcore1 L123:
> 180 1[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 254 19[ ] "ibsw55" ( )
> 180 2[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 253 19[ ] "ibsw56" ( )
> 180 3[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 258 19[ ] "ibsw57" ( )
> 180 4[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 257 19[ ] "ibsw58" ( )
> 180 5[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 256 19[ ] "ibsw59" ( )
> 180 6[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 255 19[ ] "ibsw60" ( )
> 180 7[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 261 19[ ] "ibsw61" ( )
> 180 8[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 262 19[ ] "ibsw62" ( )
> 180 9[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 260 19[ ] "ibsw63" ( )
> 180 10[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 259 19[ ] "ibsw64" ( )
> 180 11[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 284 19[ ] "ibsw65" ( )
> 180 12[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 285 19[ ] "ibsw66" ( )
> 180 13[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 2227 19[ ] "ibsw67" ( )
> 180 14[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 283 19[ ] "ibsw68" ( )
> 180 15[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 267 19[ ] "ibsw69" ( )
> 180 16[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 270 19[ ] "ibsw70" ( )
> 180 17[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 269 19[ ] "ibsw71" ( )
> 180 18[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 268 19[ ] "ibsw72" ( )
> 180 19[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 222 17[ ] "ibcore1 S117B" ( )
> 180 20[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 209 19[ ] "ibcore1 S211B" ( )
> 180 21[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 218 21[ ] "ibcore1 S117A" ( )
> 180 22[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 192 23[ ] "ibcore1 S215B" ( )
> 180 23[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 85 15[ ] "ibcore1 S209A" ( )
> 180 24[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 182 13[ ] "ibcore1 S215A" ( )
> 180 25[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 200 11[ ] "ibcore1 S115B" ( )
> 180 26[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 129 25[ ] "ibcore1 S209B" ( )
> 180 27[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 213 27[ ] "ibcore1 S115A" ( )
> 180 28[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 197 29[ ] "ibcore1 S213B" ( )
> 180 29[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 178 28[ ] "ibcore1 S111A" ( )
> 180 30[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 215 7[ ] "ibcore1 S213A" ( )
> 180 31[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 207 5[ ] "ibcore1 S113B" ( )
> 180 32[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 212 6[ ] "ibcore1 S211A" ( )
> 180 33[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 154 33[ ] "ibcore1 S113A" ( )
> 180 34[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 194 35[ ] "ibcore1 S217B" ( )
> 180 35[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 191 3[ ] "ibcore1 S111B" ( )
> 180 36[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 219 1[ ] "ibcore1 S217A" ( )
>
> This is a line board that connects up to spine boards (ibcore1 S*
> switches) and down to leaf/edge switches (ibsw*). As you can see the
> line board connects to the ports on the spine switches in a random
> fashion (to the casual observer).
>
> The "remote_guid_sorting" option will slightly tweak routing so that
> instead of finding a port to route through by searching ports 1 to N. It
> will (effectively) sort the ports based on remote connected node guid,
> then pick a port searching from lowest guid to highest guid. That way
> the routing calculations across each line/leaf board and spine switch
> will be consistent.
>
> This patch (on top of the port_shifting one above) improved the mpiGraph
> average send/recv bandwidth from 991 MB/s& 1172 MB/s to 1045 MB/s and
> 1228 MB/s.
>
> Al
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2010-10-12 7:59 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-10-08 17:40 [opensm] RFC: new routing options Albert Chu
[not found] ` <1286559644.12737.38.camel-X2zTWyBD0EhliZ7u+bvwcg@public.gmane.org>
2010-10-12 7:59 ` Yevgeny Kliteynik [this message]
[not found] ` <4CB4155A.3030904-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2010-10-12 16:43 ` Albert Chu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4CB4155A.3030904@dev.mellanox.co.il \
--to=kliteyn-ldsdmyg8hgv8yrgs2mwiifqbs+8scbdb@public.gmane.org \
--cc=chu11-i2BcT+NCU+M@public.gmane.org \
--cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=sashak-smomgflXvOZWk0Htik3J/w@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox