From mboxrd@z Thu Jan  1 00:00:00 1970
From: Yevgeny Kliteynik <kliteyn-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
Subject: Re: QoS settings not mapped correctly per pkey ?
Date: Wed, 25 Nov 2009 16:37:02 +0200
Message-ID: <4B0D410E.2010903@dev.mellanox.co.il>
References: <4B0D0DB2.6080802@bull.net> <4B0D1F36.1090007@dev.mellanox.co.il> <4B0D38C7.3080505@bull.net>
Reply-To: kliteyn-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-15; format=flowed
Content-Transfer-Encoding: 7bit
Return-path: <linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
In-Reply-To: <4B0D38C7.3080505-6ktuUTfB/bM@public.gmane.org>
Sender: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
To: Vincent Ficet <jean-vincent.ficet-6ktuUTfB/bM@public.gmane.org>
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, BOURDE CELINE <Celine.Bourde-6ktuUTfB/bM@public.gmane.org>
List-Id: linux-rdma@vger.kernel.org

Vincent Ficet wrote:
> Yevgeny,
> 
>> Hi Vincent,
>>
>> Vincent Ficet wrote:
>>> Hello,
>>>
>>> Following the QoS experiments I carried out yesterday, I wanted to set
>>> up 3 IP networks, each one bound to a particular pkey, in order to
>>> achieve QoS for each network.
>>> Unfortunately, it seems that something is not mapped properly in the ULP
>>> layers (vlarb tables are fine).
>>>
>>> The settings are as follows:
>>>
>>> opensm.conf:
>>> ------------
>>>
>>> qos_max_vls    8
>>> qos_high_limit 1
>>> qos_vlarb_high 0:0,1:0,2:0,3:0,4:0,5:0
>>> qos_vlarb_low  0:8,1:1,2:1,3:4,4:0,5:0
>>> qos_sl2vl      0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
>> Please check section 7 of the QoS_management_in_OpenSM.txt
>> doc. It explains what exactly is the meaning of the values
>> in the VLArb table. It also has explanation of the problem
>> that you're seeing. Quoting from there:
>>
>> "Keep in mind that ports usually transmit packets of
>>  size equal to MTU. For instance, for 4KB MTU a single
>>  packet will require 64 credits, so in order to achieve
>>  effective VL arbitration for packets of 4KB MTU, the
>>  weighting values for each VL should be multiples of 64."
>>
> OK, I see the point.
> 
> To check that it works as you said. we changed the IPoIB MTU from 2044
> to 2000 in order to make sure that it fits into the IB MTU. which is set
> to 2K on our cluster.
> In theory, such a 2K packet would require 32 packets (credits) of 64 bytes.
> 
> We changed the vlarb tables with increments of 32 (for VL 1,2,3):
> 
> qos_max_vls    8
> qos_high_limit 1
> qos_vlarb_high 0:0,1:0,2:0,3:0,4:0,5:0
> qos_vlarb_low  0:8,1:32,2:64,3:96,4:0,5:0
> qos_sl2vl      0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
> 
> and we also tried increments of 64:
> 
> qos_max_vls    8
> qos_high_limit 1
> qos_vlarb_high 0:0,1:0,2:0,3:0,4:0,5:0
> qos_vlarb_low  0:8,1:64,2:128,3:192,4:0,5:0
> qos_sl2vl      0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
> 
> But still, it does not make any difference:
> 
>  [root@pichu22 ~]# while test -e keep_going; do iperf -c pichu16-ic0 -t
> 20 2>&1; done | grep Gbits/sec
> [  3]  0.0-20.0 sec  13.0 GBytes  5.57 Gbits/sec
> [  3]  0.0-20.0 sec  12.9 GBytes  5.53 Gbits/sec
> [  3]  0.0-20.0 sec  12.0 GBytes  5.17 Gbits/sec
> 
> [root@pichu22 ~]# while test -e keep_going; do iperf -c pichu16-backbone
> -t 20 2>&1; done | grep Gbits/sec
> [  3]  0.0-20.0 sec  13.1 GBytes  5.61 Gbits/sec
> [  3]  0.0-20.0 sec  11.9 GBytes  5.09 Gbits/sec
> [  3]  0.0-20.0 sec  9.43 GBytes  4.05 Gbits/sec
> 
> [root@pichu22 ~]# while test -e keep_going; do iperf -c pichu16-admin -t
> 20 2>&1; done | grep Gbits/sec
> [  3]  0.0-20.0 sec  10.5 GBytes  4.50 Gbits/sec
> [  3]  0.0-20.0 sec  12.3 GBytes  5.28 Gbits/sec
> [  3]  0.0-20.0 sec  12.0 GBytes  5.15 Gbits/sec
> 
> Any other idea ?

OK, so there are three possible reasons that I can think of:
1. Something is wrong in the configuration.
2. The application does not saturate the link, thus QoS
   and the whole VL arbitration thing doesn't kick in.
3. There's some bug, somewhere.

Let's start with reason no. 1.
Please shut off each of the SLs one by one, and
make sure that the application gets zero BW on
these SLs. You can do it by mapping SL to VL15:

 qos_sl2vl      0,15,2,3,4,5,6,7,8,9,10,11,12,13,14,15

and then 

 qos_sl2vl      0,1,15,3,4,5,6,7,8,9,10,11,12,13,14,15

and then 

 qos_sl2vl      0,1,2,15,4,5,6,7,8,9,10,11,12,13,14,15

If this part works well, then we will continue to
reason no. 2.

-- Yevgeny

 
> Thanks for your help.
> 
> Vincent
> 
>> -- Yevgeny
>>
>>
>>> The corresponding VLArb tables are fine on both the server (pichu16) and
>>> the client (pichu22):
>>>
>>> [root@pichu22 network-scripts]# smpquery vlarb -D 0
>>> # VLArbitration tables: DR path slid 65535; dlid 65535; 0 port 0 LowCap
>>> 8 HighCap 8
>>> # Low priority VL Arbitration Table:
>>> VL    : |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x0 |0x0 |
>>> WEIGHT: |0x8 |0x1 |0x1 |0x4 |0x0 |0x0 |0x0 |0x0 |
>>> # High priority VL Arbitration Table:
>>> VL    : |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x0 |0x0 |
>>> WEIGHT: |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
>>>
>>> [root@pichu16 ~]# smpquery vlarb -D 0
>>> # VLArbitration tables: DR path slid 65535; dlid 65535; 0 port 0 LowCap
>>> 8 HighCap 8
>>> # Low priority VL Arbitration Table:
>>> VL    : |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x0 |0x0 |
>>> WEIGHT: |0x8 |0x1 |0x1 |0x4 |0x0 |0x0 |0x0 |0x0 |
>>> # High priority VL Arbitration Table:
>>> VL    : |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x0 |0x0 |
>>> WEIGHT: |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
>>>
>>> partitions.conf:
>>> ---------------
>>>
>>> default=0x7fff,ipoib            : ALL=full;
>>> ip_backbone=0x0001,ipoib        : ALL=full;
>>> ip_admin=0x0002,ipoib            : ALL=full;
>>>
>>> qos-policy.conf:
>>> ---------------
>>>
>>> qos-ulps
>>>     default                : 0 # default SL
>>>     ipoib, pkey 0x7FFF     : 1 # IP with default pkey 0x7FFF
>>>     ipoib, pkey 0x1        : 2 # backbone IP with pkey 0x1
>>>     ipoib, pkey 0x2        : 3 # admin IP with pkey 0x2
>>> end-qos-ulps
>>>
>>> Assigned IP addresses (in /etc/hosts):
>>> -------------------------------------
>>>
>>> 10.12.1.4       pichu16-ic0             # default IPoIB network, pkey
>>> 0x7FFF
>>> 10.13.1.4       pichu16-backbone        # IPoIB backbone network,
>>> pkey 0x1
>>> 10.14.1.4       pichu16-admin           # IPoIB admin network, pkey 0x2
>>> 10.12.1.10      pichu22-ic0             # default IPoIB network, pkey
>>> 0x7FFF
>>> 10.13.1.10      pichu22-backbone        # IPoIB backbone network,
>>> pkey 0x1
>>> 10.14.1.10      pichu22-admin           # IPoIB admin network, pkey 0x2
>>>
>>> Note that the netmask is /16, so the -ic0, -backbone and -admin networks
>>> cannot see each other.
>>>
>>> IPoIB settings on server side:
>>> ------------------------------
>>>
>>> [root@pichu16 ~]# tail -n 5 /etc/sysconfig/network-scripts/ifcfg-ib0*
>>> ==> /etc/sysconfig/network-scripts/ifcfg-ib0 <==
>>> BOOTPROTO=static
>>> IPADDR=10.12.1.4
>>> NETMASK=255.255.0.0
>>> ONBOOT=yes
>>> MTU=2044
>>>
>>> ==> /etc/sysconfig/network-scripts/ifcfg-ib0.8001 <==
>>> BOOTPROTO=static
>>> IPADDR=10.13.1.4
>>> NETMASK=255.255.0.0
>>> ONBOOT=yes
>>> MTU=2044
>>>
>>> ==> /etc/sysconfig/network-scripts/ifcfg-ib0.8002 <==
>>> BOOTPROTO=static
>>> IPADDR=10.14.1.4
>>> NETMASK=255.255.0.0
>>> ONBOOT=yes
>>> MTU=2044
>>>
>>> [root@pichu16 ~]# ip addr show ib0
>>> 4: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast
>>> state UP qlen 256
>>>     link/infiniband
>>> 80:00:00:48:fe:80:00:00:00:00:00:00:2c:90:00:10:0d:00:05:6d brd
>>> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
>>>     inet 10.12.1.4/16 brd 10.12.255.255 scope global ib0
>>>     inet 10.13.1.4/16 brd 10.13.255.255 scope global ib0
>>>     inet 10.14.1.4/16 brd 10.14.255.255 scope global ib0
>>>     inet6 fe80::2e90:10:d00:56d/64 scope link
>>>        valid_lft forever preferred_lft forever
>>>
>>> IPoIB settings on client side:
>>> ------------------------------
>>>
>>> [root@pichu22 ~]# tail -n 5 /etc/sysconfig/network-scripts/ifcfg-ib0*
>>> ==> /etc/sysconfig/network-scripts/ifcfg-ib0 <==
>>> BOOTPROTO=static
>>> IPADDR=10.12.1.10
>>> NETMASK=255.255.0.0
>>> ONBOOT=yes
>>> MTU=2044
>>>
>>> ==> /etc/sysconfig/network-scripts/ifcfg-ib0.8001 <==
>>> BOOTPROTO=static
>>> IPADDR=10.13.1.10
>>> NETMASK=255.255.0.0
>>> ONBOOT=yes
>>> MTU=2044
>>>
>>> ==> /etc/sysconfig/network-scripts/ifcfg-ib0.8002 <==
>>> BOOTPROTO=static
>>> IPADDR=10.14.1.10
>>> NETMASK=255.255.0.0
>>> ONBOOT=yes
>>> MTU=2044
>>>
>>> [root@pichu22 ~]# ip addr show ib0
>>> 48: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast
>>> state UP qlen 256
>>>     link/infiniband
>>> 80:00:00:48:fe:80:00:00:00:00:00:00:2c:90:00:10:0d:00:06:79 brd
>>> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
>>>     inet 10.12.1.10/16 brd 10.12.255.255 scope global ib0
>>>     inet 10.13.1.10/16 brd 10.13.255.255 scope global ib0
>>>     inet 10.14.1.10/16 brd 10.14.255.255 scope global ib0
>>>     inet6 fe80::2e90:10:d00:679/64 scope link
>>>        valid_lft forever preferred_lft forever
>>>
>>> Iperf servers on server side:
>>> -----------------------------
>>>
>>> Quoting from iperf help:
>>>   -B, --bind      <host>   bind to <host>, an interface or multicast
>>> address
>>>   -s, --server             run in server mode
>>>
>>> Each iperf server is bound to a dedicated interface as follows:
>>>
>>> [root@pichu16 ~]# iperf -s -B pichu16-backbone
>>> [root@pichu16 ~]# iperf -s -B pichu16-admin
>>> [root@pichu16 ~]# iperf -s -B pichu16-ic0
>>>
>>> Iperf clients on client side:
>>> -----------------------------
>>>
>>> Quoting from iperf help:
>>>   -c, --client    <host>   run in client mode, connecting to <host>
>>>   -t, --time      #        time in seconds to transmit for (default
>>> 10 secs)
>>>
>>> And each iperf client talks to the corresponding iperf server:
>>>
>>> [root@pichu22 ~]# while test -e keep_going; do iperf -c pichu16-ic0 -t
>>> 100 2>&1; done | grep Gbits/sec
>>> [  3]  0.0-100.0 sec  64.6 GBytes  5.55 Gbits/sec
>>> [  3]  0.0-100.0 sec  64.5 GBytes  5.54 Gbits/sec
>>> [  3]  0.0-100.0 sec  60.5 GBytes  5.20 Gbits/sec
>>> [root@pichu22 ~]# while test -e keep_going; do iperf -c pichu16-backbone
>>> -t 100 2>&1; done | grep Gbits/sec
>>> [  3]  0.0-100.0 sec  64.8 GBytes  5.57 Gbits/sec
>>> [  3]  0.0-100.0 sec  56.7 GBytes  4.87 Gbits/sec
>>> [  3]  0.0-100.0 sec  59.7 GBytes  5.13 Gbits/sec
>>> [root@pichu22 ~]# while test -e keep_going; do iperf -c pichu16-admin -t
>>> 100 2>&1; done | grep Gbits/sec
>>> [  3]  0.0-100.0 sec  57.3 GBytes  4.92 Gbits/sec
>>> [  3]  0.0-100.0 sec  61.6 GBytes  5.29 Gbits/sec
>>> [  3]  0.0-100.0 sec  62.7 GBytes  5.38 Gbits/sec
>>>
>>> Given the VLarb weights assigned (1 for *-ic0 on VL1, 1 for *-backbone
>>> on VL2 and 4 for *-admin on VL3), we would expect different b/w figures
>>> for the *-admin network.
>>> As we can see, all iperf values are the same, showing that QoS is not
>>> enforced on a per pkey basis.
>>> It seems to me that something is not mapped properly in the ULP layers.
>>> Could anyone tell me if I'm wrong here ? If not, is that a known issue ?
>>>
>>> Thanks for your help,
>>>
>>> Vincent
>>>
>>>
>>>
>>>
>>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html