public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
* insight into a WARNING from softROCE
@ 2017-12-08 19:50 Olga Kornievskaia
       [not found] ` <CAN-5tyFENoK6f20zjeUEbXNQh-bZAaH-iBf4Q6G8uLjz0eHnqA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Olga Kornievskaia @ 2017-12-08 19:50 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hi folks,

Can somebody give me an insight into to following WARNING (at the end
of the message)  that I see logged in var log messages while using
softROCE (NFSoRDMA)? This is typically associated with a hiccup in
communication I see happening over RDMA (long delays).

It's coming form the WARN here in rxe_comp.c:

                case COMPST_ERROR:
                        WARN_ON_ONCE(wqe->status == IB_WC_SUCCESS);
                        do_complete(qp, wqe);
                        rxe_qp_error(qp);

                        if (pkt) {
                                rxe_drop_ref(pkt->qp);

With a little bit of printks I tracked it to:
COMPST_ERROR is coming from "retrying counter exceeding"
(RXE_CNT_RETRY_EXCEEDED)  in COMPST_ERROR_RETRY. COMPST_ERROR_RETRY is
coming from check_psn(). I see that packet psn is greater then the wqe
psn. I have noticed that can happen (but not always) after
update_wqe_psn() has number of packets left to send some number larger
than 1.

Goal is to figure out why the hiccups are happening and I think this is a clue.

Thank you for any info.

Dec  5 16:42:16 localhost kernel: ------------[ cut here ]------------
Dec  5 16:42:16 localhost kernel: WARNING: CPU: 0 PID: 0 at
drivers/infiniband/sw/rxe/rxe_comp.c:741 rxe_completer+0xd84/0xe30
[rdma_rxe]
Dec  5 16:42:16 localhost kernel: Modules linked in: rpcrdma ib_ucm
ib_umad rdma_rxe ip6_udp_tunnel udp_tunnel rdma_ucm rdma_cm iw_cm
ib_cm ib_uverbs ib_core rfcomm fuse ip6t_rpfilter ipt_REJECT
nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat
ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6
nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security
ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4
nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw
ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bnep
snd_seq_midi snd_seq_midi_event coretemp crc32_pclmul ext4
ghash_clmulni_intel mbcache jbd2 aesni_intel snd_ens1371
snd_ac97_codec glue_helper ppdev lrw ac97_bus snd_seq gf128mul
uvcvideo ablk_helper cryptd vmw_balloon videobuf2_vmalloc
videobuf2_memops
Dec  5 16:42:16 localhost kernel: btusb snd_pcm videobuf2_core pcspkr
btrtl videodev btbcm btintel snd_timer snd_rawmidi bluetooth
snd_seq_device snd vmw_vmci rfkill shpchp i2c_piix4 soundcore
parport_pc parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc
ip_tables xfs libcrc32c sr_mod cdrom vmwgfx sd_mod crc_t10dif
crct10dif_generic drm_kms_helper ata_generic syscopyarea sysfillrect
sysimgblt fb_sys_fops ttm drm pata_acpi crct10dif_pclmul ahci
crct10dif_common mptspi crc32c_intel libahci scsi_transport_spi
mptscsih serio_raw ata_piix libata mptbase e1000 i2c_core dm_mirror
dm_region_hash dm_log dm_mod
Dec  5 16:42:16 localhost kernel: CPU: 0 PID: 0 Comm: swapper/0 Not
tainted 3.10.0 #2
Dec  5 16:42:16 localhost kernel: Hardware name: VMware, Inc. VMware
Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00
07/02/2015
Dec  5 16:42:16 localhost kernel: Call Trace:
Dec  5 16:42:16 localhost kernel: <IRQ>  [<ffffffff94cb9865>]
dump_stack+0x19/0x1b
Dec  5 16:42:16 localhost kernel: [<ffffffff94686968>] __warn+0xd8/0x100
Dec  5 16:42:16 localhost kernel: [<ffffffff94686aad>]
warn_slowpath_null+0x1d/0x20
Dec  5 16:42:16 localhost kernel: [<ffffffffc0a73734>]
rxe_completer+0xd84/0xe30 [rdma_rxe]
Dec  5 16:42:16 localhost kernel: [<ffffffffc0a7f24f>]
rxe_do_task+0x9f/0x110 [rdma_rxe]
Dec  5 16:42:16 localhost kernel: [<ffffffffc0a7f3b8>]
rxe_run_task+0x18/0x40 [rdma_rxe]
Dec  5 16:42:16 localhost kernel: [<ffffffffc0a729a5>]
rxe_comp_queue_pkt+0x45/0x50 [rdma_rxe]
Dec  5 16:42:16 localhost kernel: [<ffffffffc0a77bf8>]
rxe_rcv+0x2a8/0x920 [rdma_rxe]
Dec  5 16:42:16 localhost kernel: [<ffffffffc03bcc5f>] ?
ipt_do_table+0x31f/0x4f0 [ip_tables]
Dec  5 16:42:16 localhost kernel: [<ffffffffc0a7fa10>] ?
net_to_rxe+0x80/0x80 [rdma_rxe]
Dec  5 16:42:16 localhost kernel: [<ffffffffc0a7fa73>]
rxe_udp_encap_recv+0x63/0xa0 [rdma_rxe]
Dec  5 16:42:16 localhost kernel: [<ffffffffc0a7fa73>] ?
rxe_udp_encap_recv+0x63/0xa0 [rdma_rxe]
Dec  5 16:42:16 localhost kernel: [<ffffffff94c148bb>]
udp_queue_rcv_skb+0x1bb/0x4a0
Dec  5 16:42:16 localhost kernel: [<ffffffff94c15108>]
__udp4_lib_rcv+0x568/0xb90
Dec  5 16:42:16 localhost kernel: [<ffffffffc09281de>] ?
ipv4_confirm+0x4e/0x100 [nf_conntrack_ipv4]
Dec  5 16:42:16 localhost kernel: [<ffffffff94c15b9a>] udp_rcv+0x1a/0x20
Dec  5 16:42:16 localhost kernel: [<ffffffff94be30ce>]
ip_local_deliver_finish+0x8e/0x1d0
Dec  5 16:42:16 localhost kernel: [<ffffffff94be33b9>]
ip_local_deliver+0x59/0xd0
Dec  5 16:42:16 localhost kernel: [<ffffffff94be3040>] ?
ip_rcv_finish+0x300/0x300
Dec  5 16:42:16 localhost kernel: [<ffffffff94be2db8>] ip_rcv_finish+0x78/0x300
Dec  5 16:42:16 localhost kernel: [<ffffffff94be36e6>] ip_rcv+0x2b6/0x410
Dec  5 16:42:16 localhost kernel: [<ffffffff94be2d40>] ?
inet_del_offload+0x40/0x40
Dec  5 16:42:16 localhost kernel: [<ffffffff94b9f9d4>]
__netif_receive_skb_core+0x2e4/0x820
Dec  5 16:42:16 localhost kernel: [<ffffffff94b9ff28>]
__netif_receive_skb+0x18/0x60
Dec  5 16:42:16 localhost kernel: [<ffffffff94b9ffb0>]
netif_receive_skb_internal+0x40/0xc0
Dec  5 16:42:16 localhost kernel: [<ffffffff94ba0b58>]
napi_gro_receive+0xd8/0x100
Dec  5 16:42:16 localhost kernel: [<ffffffffc01f33e8>]
e1000_clean_rx_irq+0x2b8/0x510 [e1000]
Dec  5 16:42:16 localhost kernel: [<ffffffffc01f4078>]
e1000_clean+0x278/0x8d0 [e1000]
Dec  5 16:42:16 localhost kernel: [<ffffffff94ba0483>] net_rx_action+0x123/0x320
Dec  5 16:42:16 localhost kernel: [<ffffffff9468fb4f>] __do_softirq+0xef/0x280
Dec  5 16:42:16 localhost kernel: [<ffffffff94ccc51c>] call_softirq+0x1c/0x30
Dec  5 16:42:16 localhost kernel: [<ffffffff9462c4c5>] do_softirq+0x65/0xa0
Dec  5 16:42:16 localhost kernel: [<ffffffff9468fed5>] irq_exit+0x105/0x110
Dec  5 16:42:16 localhost kernel: [<ffffffff94ccd036>] do_IRQ+0x56/0xe0
Dec  5 16:42:16 localhost kernel: [<ffffffff94cc1b2d>]
common_interrupt+0x6d/0x6d
Dec  5 16:42:16 localhost kernel: <EOI>  [<ffffffff94cc0dd6>] ?
native_safe_halt+0x6/0x10
Dec  5 16:42:16 localhost kernel: [<ffffffff94cc0c6e>] ? default_idle+0x1e/0xc0
Dec  5 16:42:16 localhost kernel: [<ffffffff94633f86>] ? arch_cpu_idle+0x26/0x30
Dec  5 16:42:16 localhost kernel: [<ffffffff946e6efa>] ?
cpu_startup_entry+0x14a/0x1c0
Dec  5 16:42:16 localhost kernel: [<ffffffff94ca8e17>] ? rest_init+0x77/0x80
Dec  5 16:42:16 localhost kernel: [<ffffffff9516c05a>] ?
start_kernel+0x433/0x454
Dec  5 16:42:16 localhost kernel: [<ffffffff9516ba30>] ?
repair_env_string+0x5c/0x5c
Dec  5 16:42:16 localhost kernel: [<ffffffff9516b120>] ?
early_idt_handler_array+0x120/0x120
Dec  5 16:42:16 localhost kernel: [<ffffffff9516b5ef>] ?
x86_64_start_reservations+0x24/0x26
Dec  5 16:42:16 localhost kernel: [<ffffffff9516b740>] ?
x86_64_start_kernel+0x14f/0x172
Dec  5 16:42:16 localhost kernel: [<ffffffff946001a5>] ? start_cpu+0x5/0x14
Dec  5 16:42:16 localhost kernel: ---[ end trace c96ed928ed9503ca ]---
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: insight into a WARNING from softROCE
       [not found] ` <CAN-5tyFENoK6f20zjeUEbXNQh-bZAaH-iBf4Q6G8uLjz0eHnqA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-12-19 13:15   ` Leon Romanovsky
  2017-12-21  8:19   ` Moni Shoua
  1 sibling, 0 replies; 6+ messages in thread
From: Leon Romanovsky @ 2017-12-19 13:15 UTC (permalink / raw)
  To: Olga Kornievskaia, Moni Shoua, Yonatan Cohen
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 7591 bytes --]

Moni/Yonatan?

On Fri, Dec 08, 2017 at 02:50:10PM -0500, Olga Kornievskaia wrote:
> Hi folks,
>
> Can somebody give me an insight into to following WARNING (at the end
> of the message)  that I see logged in var log messages while using
> softROCE (NFSoRDMA)? This is typically associated with a hiccup in
> communication I see happening over RDMA (long delays).
>
> It's coming form the WARN here in rxe_comp.c:
>
>                 case COMPST_ERROR:
>                         WARN_ON_ONCE(wqe->status == IB_WC_SUCCESS);
>                         do_complete(qp, wqe);
>                         rxe_qp_error(qp);
>
>                         if (pkt) {
>                                 rxe_drop_ref(pkt->qp);
>
> With a little bit of printks I tracked it to:
> COMPST_ERROR is coming from "retrying counter exceeding"
> (RXE_CNT_RETRY_EXCEEDED)  in COMPST_ERROR_RETRY. COMPST_ERROR_RETRY is
> coming from check_psn(). I see that packet psn is greater then the wqe
> psn. I have noticed that can happen (but not always) after
> update_wqe_psn() has number of packets left to send some number larger
> than 1.
>
> Goal is to figure out why the hiccups are happening and I think this is a clue.
>
> Thank you for any info.
>
> Dec  5 16:42:16 localhost kernel: ------------[ cut here ]------------
> Dec  5 16:42:16 localhost kernel: WARNING: CPU: 0 PID: 0 at
> drivers/infiniband/sw/rxe/rxe_comp.c:741 rxe_completer+0xd84/0xe30
> [rdma_rxe]
> Dec  5 16:42:16 localhost kernel: Modules linked in: rpcrdma ib_ucm
> ib_umad rdma_rxe ip6_udp_tunnel udp_tunnel rdma_ucm rdma_cm iw_cm
> ib_cm ib_uverbs ib_core rfcomm fuse ip6t_rpfilter ipt_REJECT
> nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat
> ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6
> nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security
> ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4
> nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw
> ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bnep
> snd_seq_midi snd_seq_midi_event coretemp crc32_pclmul ext4
> ghash_clmulni_intel mbcache jbd2 aesni_intel snd_ens1371
> snd_ac97_codec glue_helper ppdev lrw ac97_bus snd_seq gf128mul
> uvcvideo ablk_helper cryptd vmw_balloon videobuf2_vmalloc
> videobuf2_memops
> Dec  5 16:42:16 localhost kernel: btusb snd_pcm videobuf2_core pcspkr
> btrtl videodev btbcm btintel snd_timer snd_rawmidi bluetooth
> snd_seq_device snd vmw_vmci rfkill shpchp i2c_piix4 soundcore
> parport_pc parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc
> ip_tables xfs libcrc32c sr_mod cdrom vmwgfx sd_mod crc_t10dif
> crct10dif_generic drm_kms_helper ata_generic syscopyarea sysfillrect
> sysimgblt fb_sys_fops ttm drm pata_acpi crct10dif_pclmul ahci
> crct10dif_common mptspi crc32c_intel libahci scsi_transport_spi
> mptscsih serio_raw ata_piix libata mptbase e1000 i2c_core dm_mirror
> dm_region_hash dm_log dm_mod
> Dec  5 16:42:16 localhost kernel: CPU: 0 PID: 0 Comm: swapper/0 Not
> tainted 3.10.0 #2
> Dec  5 16:42:16 localhost kernel: Hardware name: VMware, Inc. VMware
> Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00
> 07/02/2015
> Dec  5 16:42:16 localhost kernel: Call Trace:
> Dec  5 16:42:16 localhost kernel: <IRQ>  [<ffffffff94cb9865>]
> dump_stack+0x19/0x1b
> Dec  5 16:42:16 localhost kernel: [<ffffffff94686968>] __warn+0xd8/0x100
> Dec  5 16:42:16 localhost kernel: [<ffffffff94686aad>]
> warn_slowpath_null+0x1d/0x20
> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a73734>]
> rxe_completer+0xd84/0xe30 [rdma_rxe]
> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a7f24f>]
> rxe_do_task+0x9f/0x110 [rdma_rxe]
> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a7f3b8>]
> rxe_run_task+0x18/0x40 [rdma_rxe]
> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a729a5>]
> rxe_comp_queue_pkt+0x45/0x50 [rdma_rxe]
> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a77bf8>]
> rxe_rcv+0x2a8/0x920 [rdma_rxe]
> Dec  5 16:42:16 localhost kernel: [<ffffffffc03bcc5f>] ?
> ipt_do_table+0x31f/0x4f0 [ip_tables]
> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a7fa10>] ?
> net_to_rxe+0x80/0x80 [rdma_rxe]
> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a7fa73>]
> rxe_udp_encap_recv+0x63/0xa0 [rdma_rxe]
> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a7fa73>] ?
> rxe_udp_encap_recv+0x63/0xa0 [rdma_rxe]
> Dec  5 16:42:16 localhost kernel: [<ffffffff94c148bb>]
> udp_queue_rcv_skb+0x1bb/0x4a0
> Dec  5 16:42:16 localhost kernel: [<ffffffff94c15108>]
> __udp4_lib_rcv+0x568/0xb90
> Dec  5 16:42:16 localhost kernel: [<ffffffffc09281de>] ?
> ipv4_confirm+0x4e/0x100 [nf_conntrack_ipv4]
> Dec  5 16:42:16 localhost kernel: [<ffffffff94c15b9a>] udp_rcv+0x1a/0x20
> Dec  5 16:42:16 localhost kernel: [<ffffffff94be30ce>]
> ip_local_deliver_finish+0x8e/0x1d0
> Dec  5 16:42:16 localhost kernel: [<ffffffff94be33b9>]
> ip_local_deliver+0x59/0xd0
> Dec  5 16:42:16 localhost kernel: [<ffffffff94be3040>] ?
> ip_rcv_finish+0x300/0x300
> Dec  5 16:42:16 localhost kernel: [<ffffffff94be2db8>] ip_rcv_finish+0x78/0x300
> Dec  5 16:42:16 localhost kernel: [<ffffffff94be36e6>] ip_rcv+0x2b6/0x410
> Dec  5 16:42:16 localhost kernel: [<ffffffff94be2d40>] ?
> inet_del_offload+0x40/0x40
> Dec  5 16:42:16 localhost kernel: [<ffffffff94b9f9d4>]
> __netif_receive_skb_core+0x2e4/0x820
> Dec  5 16:42:16 localhost kernel: [<ffffffff94b9ff28>]
> __netif_receive_skb+0x18/0x60
> Dec  5 16:42:16 localhost kernel: [<ffffffff94b9ffb0>]
> netif_receive_skb_internal+0x40/0xc0
> Dec  5 16:42:16 localhost kernel: [<ffffffff94ba0b58>]
> napi_gro_receive+0xd8/0x100
> Dec  5 16:42:16 localhost kernel: [<ffffffffc01f33e8>]
> e1000_clean_rx_irq+0x2b8/0x510 [e1000]
> Dec  5 16:42:16 localhost kernel: [<ffffffffc01f4078>]
> e1000_clean+0x278/0x8d0 [e1000]
> Dec  5 16:42:16 localhost kernel: [<ffffffff94ba0483>] net_rx_action+0x123/0x320
> Dec  5 16:42:16 localhost kernel: [<ffffffff9468fb4f>] __do_softirq+0xef/0x280
> Dec  5 16:42:16 localhost kernel: [<ffffffff94ccc51c>] call_softirq+0x1c/0x30
> Dec  5 16:42:16 localhost kernel: [<ffffffff9462c4c5>] do_softirq+0x65/0xa0
> Dec  5 16:42:16 localhost kernel: [<ffffffff9468fed5>] irq_exit+0x105/0x110
> Dec  5 16:42:16 localhost kernel: [<ffffffff94ccd036>] do_IRQ+0x56/0xe0
> Dec  5 16:42:16 localhost kernel: [<ffffffff94cc1b2d>]
> common_interrupt+0x6d/0x6d
> Dec  5 16:42:16 localhost kernel: <EOI>  [<ffffffff94cc0dd6>] ?
> native_safe_halt+0x6/0x10
> Dec  5 16:42:16 localhost kernel: [<ffffffff94cc0c6e>] ? default_idle+0x1e/0xc0
> Dec  5 16:42:16 localhost kernel: [<ffffffff94633f86>] ? arch_cpu_idle+0x26/0x30
> Dec  5 16:42:16 localhost kernel: [<ffffffff946e6efa>] ?
> cpu_startup_entry+0x14a/0x1c0
> Dec  5 16:42:16 localhost kernel: [<ffffffff94ca8e17>] ? rest_init+0x77/0x80
> Dec  5 16:42:16 localhost kernel: [<ffffffff9516c05a>] ?
> start_kernel+0x433/0x454
> Dec  5 16:42:16 localhost kernel: [<ffffffff9516ba30>] ?
> repair_env_string+0x5c/0x5c
> Dec  5 16:42:16 localhost kernel: [<ffffffff9516b120>] ?
> early_idt_handler_array+0x120/0x120
> Dec  5 16:42:16 localhost kernel: [<ffffffff9516b5ef>] ?
> x86_64_start_reservations+0x24/0x26
> Dec  5 16:42:16 localhost kernel: [<ffffffff9516b740>] ?
> x86_64_start_kernel+0x14f/0x172
> Dec  5 16:42:16 localhost kernel: [<ffffffff946001a5>] ? start_cpu+0x5/0x14
> Dec  5 16:42:16 localhost kernel: ---[ end trace c96ed928ed9503ca ]---
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: insight into a WARNING from softROCE
       [not found] ` <CAN-5tyFENoK6f20zjeUEbXNQh-bZAaH-iBf4Q6G8uLjz0eHnqA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2017-12-19 13:15   ` Leon Romanovsky
@ 2017-12-21  8:19   ` Moni Shoua
       [not found]     ` <CAG9sBKM2qTSrB2sZnrLsUxAhL+ADursUTyPojd8HY5PF4Nf9Zg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 6+ messages in thread
From: Moni Shoua @ 2017-12-21  8:19 UTC (permalink / raw)
  To: Olga Kornievskaia; +Cc: linux-rdma

Hi Olga
As far as I can tell the warning in
drivers/infiniband/sw/rxe/rxe_comp.c:741 went through check_psn() ->
COMPST_ERROR_RETRY -> COMPST_ERROR. In that case the wqe_status should
have been IB_WC_RETRY_EXC_ERR and not IB_WC_SUCCESS.
Can you please be more specific and explain how did you get to this conclusion?
BTW, what was the test you were running?

Second, packets drops can lead to hiccups in performance. I'm not sure
if you are reporting a bug that makes RXE assume that packets were
drop or actual drops happened.

thanks


On Fri, Dec 8, 2017 at 9:50 PM, Olga Kornievskaia <aglo-63aXycvo3TyHXe+LvDLADg@public.gmane.org> wrote:
> Hi folks,
>
> Can somebody give me an insight into to following WARNING (at the end
> of the message)  that I see logged in var log messages while using
> softROCE (NFSoRDMA)? This is typically associated with a hiccup in
> communication I see happening over RDMA (long delays).
>
> It's coming form the WARN here in rxe_comp.c:
>
>                 case COMPST_ERROR:
>                         WARN_ON_ONCE(wqe->status == IB_WC_SUCCESS);
>                         do_complete(qp, wqe);
>                         rxe_qp_error(qp);
>
>                         if (pkt) {
>                                 rxe_drop_ref(pkt->qp);
>
> With a little bit of printks I tracked it to:
> COMPST_ERROR is coming from "retrying counter exceeding"
> (RXE_CNT_RETRY_EXCEEDED)  in COMPST_ERROR_RETRY. COMPST_ERROR_RETRY is
> coming from check_psn(). I see that packet psn is greater then the wqe
> psn. I have noticed that can happen (but not always) after
> update_wqe_psn() has number of packets left to send some number larger
> than 1.
>
> Goal is to figure out why the hiccups are happening and I think this is a clue.
>
> Thank you for any info.
>
> Dec  5 16:42:16 localhost kernel: ------------[ cut here ]------------
> Dec  5 16:42:16 localhost kernel: WARNING: CPU: 0 PID: 0 at
> drivers/infiniband/sw/rxe/rxe_comp.c:741 rxe_completer+0xd84/0xe30
> [rdma_rxe]
> Dec  5 16:42:16 localhost kernel: Modules linked in: rpcrdma ib_ucm
> ib_umad rdma_rxe ip6_udp_tunnel udp_tunnel rdma_ucm rdma_cm iw_cm
> ib_cm ib_uverbs ib_core rfcomm fuse ip6t_rpfilter ipt_REJECT
> nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat
> ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6
> nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security
> ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4
> nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw
> ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bnep
> snd_seq_midi snd_seq_midi_event coretemp crc32_pclmul ext4
> ghash_clmulni_intel mbcache jbd2 aesni_intel snd_ens1371
> snd_ac97_codec glue_helper ppdev lrw ac97_bus snd_seq gf128mul
> uvcvideo ablk_helper cryptd vmw_balloon videobuf2_vmalloc
> videobuf2_memops
> Dec  5 16:42:16 localhost kernel: btusb snd_pcm videobuf2_core pcspkr
> btrtl videodev btbcm btintel snd_timer snd_rawmidi bluetooth
> snd_seq_device snd vmw_vmci rfkill shpchp i2c_piix4 soundcore
> parport_pc parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc
> ip_tables xfs libcrc32c sr_mod cdrom vmwgfx sd_mod crc_t10dif
> crct10dif_generic drm_kms_helper ata_generic syscopyarea sysfillrect
> sysimgblt fb_sys_fops ttm drm pata_acpi crct10dif_pclmul ahci
> crct10dif_common mptspi crc32c_intel libahci scsi_transport_spi
> mptscsih serio_raw ata_piix libata mptbase e1000 i2c_core dm_mirror
> dm_region_hash dm_log dm_mod
> Dec  5 16:42:16 localhost kernel: CPU: 0 PID: 0 Comm: swapper/0 Not
> tainted 3.10.0 #2
> Dec  5 16:42:16 localhost kernel: Hardware name: VMware, Inc. VMware
> Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00
> 07/02/2015
> Dec  5 16:42:16 localhost kernel: Call Trace:
> Dec  5 16:42:16 localhost kernel: <IRQ>  [<ffffffff94cb9865>]
> dump_stack+0x19/0x1b
> Dec  5 16:42:16 localhost kernel: [<ffffffff94686968>] __warn+0xd8/0x100
> Dec  5 16:42:16 localhost kernel: [<ffffffff94686aad>]
> warn_slowpath_null+0x1d/0x20
> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a73734>]
> rxe_completer+0xd84/0xe30 [rdma_rxe]
> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a7f24f>]
> rxe_do_task+0x9f/0x110 [rdma_rxe]
> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a7f3b8>]
> rxe_run_task+0x18/0x40 [rdma_rxe]
> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a729a5>]
> rxe_comp_queue_pkt+0x45/0x50 [rdma_rxe]
> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a77bf8>]
> rxe_rcv+0x2a8/0x920 [rdma_rxe]
> Dec  5 16:42:16 localhost kernel: [<ffffffffc03bcc5f>] ?
> ipt_do_table+0x31f/0x4f0 [ip_tables]
> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a7fa10>] ?
> net_to_rxe+0x80/0x80 [rdma_rxe]
> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a7fa73>]
> rxe_udp_encap_recv+0x63/0xa0 [rdma_rxe]
> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a7fa73>] ?
> rxe_udp_encap_recv+0x63/0xa0 [rdma_rxe]
> Dec  5 16:42:16 localhost kernel: [<ffffffff94c148bb>]
> udp_queue_rcv_skb+0x1bb/0x4a0
> Dec  5 16:42:16 localhost kernel: [<ffffffff94c15108>]
> __udp4_lib_rcv+0x568/0xb90
> Dec  5 16:42:16 localhost kernel: [<ffffffffc09281de>] ?
> ipv4_confirm+0x4e/0x100 [nf_conntrack_ipv4]
> Dec  5 16:42:16 localhost kernel: [<ffffffff94c15b9a>] udp_rcv+0x1a/0x20
> Dec  5 16:42:16 localhost kernel: [<ffffffff94be30ce>]
> ip_local_deliver_finish+0x8e/0x1d0
> Dec  5 16:42:16 localhost kernel: [<ffffffff94be33b9>]
> ip_local_deliver+0x59/0xd0
> Dec  5 16:42:16 localhost kernel: [<ffffffff94be3040>] ?
> ip_rcv_finish+0x300/0x300
> Dec  5 16:42:16 localhost kernel: [<ffffffff94be2db8>] ip_rcv_finish+0x78/0x300
> Dec  5 16:42:16 localhost kernel: [<ffffffff94be36e6>] ip_rcv+0x2b6/0x410
> Dec  5 16:42:16 localhost kernel: [<ffffffff94be2d40>] ?
> inet_del_offload+0x40/0x40
> Dec  5 16:42:16 localhost kernel: [<ffffffff94b9f9d4>]
> __netif_receive_skb_core+0x2e4/0x820
> Dec  5 16:42:16 localhost kernel: [<ffffffff94b9ff28>]
> __netif_receive_skb+0x18/0x60
> Dec  5 16:42:16 localhost kernel: [<ffffffff94b9ffb0>]
> netif_receive_skb_internal+0x40/0xc0
> Dec  5 16:42:16 localhost kernel: [<ffffffff94ba0b58>]
> napi_gro_receive+0xd8/0x100
> Dec  5 16:42:16 localhost kernel: [<ffffffffc01f33e8>]
> e1000_clean_rx_irq+0x2b8/0x510 [e1000]
> Dec  5 16:42:16 localhost kernel: [<ffffffffc01f4078>]
> e1000_clean+0x278/0x8d0 [e1000]
> Dec  5 16:42:16 localhost kernel: [<ffffffff94ba0483>] net_rx_action+0x123/0x320
> Dec  5 16:42:16 localhost kernel: [<ffffffff9468fb4f>] __do_softirq+0xef/0x280
> Dec  5 16:42:16 localhost kernel: [<ffffffff94ccc51c>] call_softirq+0x1c/0x30
> Dec  5 16:42:16 localhost kernel: [<ffffffff9462c4c5>] do_softirq+0x65/0xa0
> Dec  5 16:42:16 localhost kernel: [<ffffffff9468fed5>] irq_exit+0x105/0x110
> Dec  5 16:42:16 localhost kernel: [<ffffffff94ccd036>] do_IRQ+0x56/0xe0
> Dec  5 16:42:16 localhost kernel: [<ffffffff94cc1b2d>]
> common_interrupt+0x6d/0x6d
> Dec  5 16:42:16 localhost kernel: <EOI>  [<ffffffff94cc0dd6>] ?
> native_safe_halt+0x6/0x10
> Dec  5 16:42:16 localhost kernel: [<ffffffff94cc0c6e>] ? default_idle+0x1e/0xc0
> Dec  5 16:42:16 localhost kernel: [<ffffffff94633f86>] ? arch_cpu_idle+0x26/0x30
> Dec  5 16:42:16 localhost kernel: [<ffffffff946e6efa>] ?
> cpu_startup_entry+0x14a/0x1c0
> Dec  5 16:42:16 localhost kernel: [<ffffffff94ca8e17>] ? rest_init+0x77/0x80
> Dec  5 16:42:16 localhost kernel: [<ffffffff9516c05a>] ?
> start_kernel+0x433/0x454
> Dec  5 16:42:16 localhost kernel: [<ffffffff9516ba30>] ?
> repair_env_string+0x5c/0x5c
> Dec  5 16:42:16 localhost kernel: [<ffffffff9516b120>] ?
> early_idt_handler_array+0x120/0x120
> Dec  5 16:42:16 localhost kernel: [<ffffffff9516b5ef>] ?
> x86_64_start_reservations+0x24/0x26
> Dec  5 16:42:16 localhost kernel: [<ffffffff9516b740>] ?
> x86_64_start_kernel+0x14f/0x172
> Dec  5 16:42:16 localhost kernel: [<ffffffff946001a5>] ? start_cpu+0x5/0x14
> Dec  5 16:42:16 localhost kernel: ---[ end trace c96ed928ed9503ca ]---
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: insight into a WARNING from softROCE
       [not found]     ` <CAG9sBKM2qTSrB2sZnrLsUxAhL+ADursUTyPojd8HY5PF4Nf9Zg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-12-22 15:43       ` Olga Kornievskaia
       [not found]         ` <581B7086-DBD2-4121-8A71-03E10757B4A1-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Olga Kornievskaia @ 2017-12-22 15:43 UTC (permalink / raw)
  To: Moni Shoua; +Cc: Olga Kornievskaia, linux-rdma


Hi Moni, 

> On Dec 21, 2017, at 3:19 AM, Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> 
> Hi Olga
> As far as I can tell the warning in
> drivers/infiniband/sw/rxe/rxe_comp.c:741 went through check_psn() ->
> COMPST_ERROR_RETRY -> COMPST_ERROR. In that case the wqe_status should
> have been IB_WC_RETRY_EXC_ERR and not IB_WC_SUCCESS.

My conclusion was from trying to figure out why the warning was seen in var log messages which then followed with error that retry limit exceeded and connection manager dropping and re establishing the connection. Sounds like my conclusion wasnt correct. 

It seems like this warning is meant to signal that something went wrong in the code and this state of wqe status success yet being in error state is unexpected.

I thought the developers would be interested in investigating but maybe it’s not an interesting condition. Specially since it sounds like your assessment is that packet loss causes the hiccup.

I wish then there was a warning that notes packet loss and warns the user. I understand the protocol assumes lossless communication so it shouldn’t be dealing w packet loss. 

> Can you please be more specific and explain how did you get to this conclusion?
 
What other specifics can I provide? I added printks trying to trace the WARN message. Should I share a patch w printks w the output?

> BTW, what was the test you were running?

I was running NFS testsuite (cthon). This was done on a laptop running 2VMs. 

> Second, packets drops can lead to hiccups in performance. I'm not sure
> if you are reporting a bug that makes RXE assume that packets were
> drop or actual drops happened.

I’m not sure if the setup experienced packet drops. What could I check to provide you with information if RXE was experiencing packet loss? I don’t believe I saw any packet loss on the wireshark capture that was going at the time. 

Thank you. 
> 
> thanks
> 
> 
>> On Fri, Dec 8, 2017 at 9:50 PM, Olga Kornievskaia <aglo-63aXycvo3TyHXe+LvDLADg@public.gmane.org> wrote:
>> Hi folks,
>> 
>> Can somebody give me an insight into to following WARNING (at the end
>> of the message)  that I see logged in var log messages while using
>> softROCE (NFSoRDMA)? This is typically associated with a hiccup in
>> communication I see happening over RDMA (long delays).
>> 
>> It's coming form the WARN here in rxe_comp.c:
>> 
>>                case COMPST_ERROR:
>>                        WARN_ON_ONCE(wqe->status == IB_WC_SUCCESS);
>>                        do_complete(qp, wqe);
>>                        rxe_qp_error(qp);
>> 
>>                        if (pkt) {
>>                                rxe_drop_ref(pkt->qp);
>> 
>> With a little bit of printks I tracked it to:
>> COMPST_ERROR is coming from "retrying counter exceeding"
>> (RXE_CNT_RETRY_EXCEEDED)  in COMPST_ERROR_RETRY. COMPST_ERROR_RETRY is
>> coming from check_psn(). I see that packet psn is greater then the wqe
>> psn. I have noticed that can happen (but not always) after
>> update_wqe_psn() has number of packets left to send some number larger
>> than 1.
>> 
>> Goal is to figure out why the hiccups are happening and I think this is a clue.
>> 
>> Thank you for any info.
>> 
>> Dec  5 16:42:16 localhost kernel: ------------[ cut here ]------------
>> Dec  5 16:42:16 localhost kernel: WARNING: CPU: 0 PID: 0 at
>> drivers/infiniband/sw/rxe/rxe_comp.c:741 rxe_completer+0xd84/0xe30
>> [rdma_rxe]
>> Dec  5 16:42:16 localhost kernel: Modules linked in: rpcrdma ib_ucm
>> ib_umad rdma_rxe ip6_udp_tunnel udp_tunnel rdma_ucm rdma_cm iw_cm
>> ib_cm ib_uverbs ib_core rfcomm fuse ip6t_rpfilter ipt_REJECT
>> nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat
>> ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6
>> nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security
>> ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4
>> nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw
>> ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bnep
>> snd_seq_midi snd_seq_midi_event coretemp crc32_pclmul ext4
>> ghash_clmulni_intel mbcache jbd2 aesni_intel snd_ens1371
>> snd_ac97_codec glue_helper ppdev lrw ac97_bus snd_seq gf128mul
>> uvcvideo ablk_helper cryptd vmw_balloon videobuf2_vmalloc
>> videobuf2_memops
>> Dec  5 16:42:16 localhost kernel: btusb snd_pcm videobuf2_core pcspkr
>> btrtl videodev btbcm btintel snd_timer snd_rawmidi bluetooth
>> snd_seq_device snd vmw_vmci rfkill shpchp i2c_piix4 soundcore
>> parport_pc parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc
>> ip_tables xfs libcrc32c sr_mod cdrom vmwgfx sd_mod crc_t10dif
>> crct10dif_generic drm_kms_helper ata_generic syscopyarea sysfillrect
>> sysimgblt fb_sys_fops ttm drm pata_acpi crct10dif_pclmul ahci
>> crct10dif_common mptspi crc32c_intel libahci scsi_transport_spi
>> mptscsih serio_raw ata_piix libata mptbase e1000 i2c_core dm_mirror
>> dm_region_hash dm_log dm_mod
>> Dec  5 16:42:16 localhost kernel: CPU: 0 PID: 0 Comm: swapper/0 Not
>> tainted 3.10.0 #2
>> Dec  5 16:42:16 localhost kernel: Hardware name: VMware, Inc. VMware
>> Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00
>> 07/02/2015
>> Dec  5 16:42:16 localhost kernel: Call Trace:
>> Dec  5 16:42:16 localhost kernel: <IRQ>  [<ffffffff94cb9865>]
>> dump_stack+0x19/0x1b
>> Dec  5 16:42:16 localhost kernel: [<ffffffff94686968>] __warn+0xd8/0x100
>> Dec  5 16:42:16 localhost kernel: [<ffffffff94686aad>]
>> warn_slowpath_null+0x1d/0x20
>> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a73734>]
>> rxe_completer+0xd84/0xe30 [rdma_rxe]
>> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a7f24f>]
>> rxe_do_task+0x9f/0x110 [rdma_rxe]
>> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a7f3b8>]
>> rxe_run_task+0x18/0x40 [rdma_rxe]
>> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a729a5>]
>> rxe_comp_queue_pkt+0x45/0x50 [rdma_rxe]
>> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a77bf8>]
>> rxe_rcv+0x2a8/0x920 [rdma_rxe]
>> Dec  5 16:42:16 localhost kernel: [<ffffffffc03bcc5f>] ?
>> ipt_do_table+0x31f/0x4f0 [ip_tables]
>> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a7fa10>] ?
>> net_to_rxe+0x80/0x80 [rdma_rxe]
>> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a7fa73>]
>> rxe_udp_encap_recv+0x63/0xa0 [rdma_rxe]
>> Dec  5 16:42:16 localhost kernel: [<ffffffffc0a7fa73>] ?
>> rxe_udp_encap_recv+0x63/0xa0 [rdma_rxe]
>> Dec  5 16:42:16 localhost kernel: [<ffffffff94c148bb>]
>> udp_queue_rcv_skb+0x1bb/0x4a0
>> Dec  5 16:42:16 localhost kernel: [<ffffffff94c15108>]
>> __udp4_lib_rcv+0x568/0xb90
>> Dec  5 16:42:16 localhost kernel: [<ffffffffc09281de>] ?
>> ipv4_confirm+0x4e/0x100 [nf_conntrack_ipv4]
>> Dec  5 16:42:16 localhost kernel: [<ffffffff94c15b9a>] udp_rcv+0x1a/0x20
>> Dec  5 16:42:16 localhost kernel: [<ffffffff94be30ce>]
>> ip_local_deliver_finish+0x8e/0x1d0
>> Dec  5 16:42:16 localhost kernel: [<ffffffff94be33b9>]
>> ip_local_deliver+0x59/0xd0
>> Dec  5 16:42:16 localhost kernel: [<ffffffff94be3040>] ?
>> ip_rcv_finish+0x300/0x300
>> Dec  5 16:42:16 localhost kernel: [<ffffffff94be2db8>] ip_rcv_finish+0x78/0x300
>> Dec  5 16:42:16 localhost kernel: [<ffffffff94be36e6>] ip_rcv+0x2b6/0x410
>> Dec  5 16:42:16 localhost kernel: [<ffffffff94be2d40>] ?
>> inet_del_offload+0x40/0x40
>> Dec  5 16:42:16 localhost kernel: [<ffffffff94b9f9d4>]
>> __netif_receive_skb_core+0x2e4/0x820
>> Dec  5 16:42:16 localhost kernel: [<ffffffff94b9ff28>]
>> __netif_receive_skb+0x18/0x60
>> Dec  5 16:42:16 localhost kernel: [<ffffffff94b9ffb0>]
>> netif_receive_skb_internal+0x40/0xc0
>> Dec  5 16:42:16 localhost kernel: [<ffffffff94ba0b58>]
>> napi_gro_receive+0xd8/0x100
>> Dec  5 16:42:16 localhost kernel: [<ffffffffc01f33e8>]
>> e1000_clean_rx_irq+0x2b8/0x510 [e1000]
>> Dec  5 16:42:16 localhost kernel: [<ffffffffc01f4078>]
>> e1000_clean+0x278/0x8d0 [e1000]
>> Dec  5 16:42:16 localhost kernel: [<ffffffff94ba0483>] net_rx_action+0x123/0x320
>> Dec  5 16:42:16 localhost kernel: [<ffffffff9468fb4f>] __do_softirq+0xef/0x280
>> Dec  5 16:42:16 localhost kernel: [<ffffffff94ccc51c>] call_softirq+0x1c/0x30
>> Dec  5 16:42:16 localhost kernel: [<ffffffff9462c4c5>] do_softirq+0x65/0xa0
>> Dec  5 16:42:16 localhost kernel: [<ffffffff9468fed5>] irq_exit+0x105/0x110
>> Dec  5 16:42:16 localhost kernel: [<ffffffff94ccd036>] do_IRQ+0x56/0xe0
>> Dec  5 16:42:16 localhost kernel: [<ffffffff94cc1b2d>]
>> common_interrupt+0x6d/0x6d
>> Dec  5 16:42:16 localhost kernel: <EOI>  [<ffffffff94cc0dd6>] ?
>> native_safe_halt+0x6/0x10
>> Dec  5 16:42:16 localhost kernel: [<ffffffff94cc0c6e>] ? default_idle+0x1e/0xc0
>> Dec  5 16:42:16 localhost kernel: [<ffffffff94633f86>] ? arch_cpu_idle+0x26/0x30
>> Dec  5 16:42:16 localhost kernel: [<ffffffff946e6efa>] ?
>> cpu_startup_entry+0x14a/0x1c0
>> Dec  5 16:42:16 localhost kernel: [<ffffffff94ca8e17>] ? rest_init+0x77/0x80
>> Dec  5 16:42:16 localhost kernel: [<ffffffff9516c05a>] ?
>> start_kernel+0x433/0x454
>> Dec  5 16:42:16 localhost kernel: [<ffffffff9516ba30>] ?
>> repair_env_string+0x5c/0x5c
>> Dec  5 16:42:16 localhost kernel: [<ffffffff9516b120>] ?
>> early_idt_handler_array+0x120/0x120
>> Dec  5 16:42:16 localhost kernel: [<ffffffff9516b5ef>] ?
>> x86_64_start_reservations+0x24/0x26
>> Dec  5 16:42:16 localhost kernel: [<ffffffff9516b740>] ?
>> x86_64_start_kernel+0x14f/0x172
>> Dec  5 16:42:16 localhost kernel: [<ffffffff946001a5>] ? start_cpu+0x5/0x14
>> Dec  5 16:42:16 localhost kernel: ---[ end trace c96ed928ed9503ca ]---
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: insight into a WARNING from softROCE
       [not found]         ` <581B7086-DBD2-4121-8A71-03E10757B4A1-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-12-25 11:09           ` Moni Shoua
       [not found]             ` <CAG9sBKOQ-gn8int2TSjnEo147jR2=Pnv7xKJqgs5iV3S7qFTEQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Moni Shoua @ 2017-12-25 11:09 UTC (permalink / raw)
  To: Olga Kornievskaia; +Cc: Olga Kornievskaia, linux-rdma

On Fri, Dec 22, 2017 at 5:43 PM, Olga Kornievskaia
<olga.kornievskaia-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
> Hi Moni,
>
>> On Dec 21, 2017, at 3:19 AM, Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>>
>> Hi Olga
>> As far as I can tell the warning in
>> drivers/infiniband/sw/rxe/rxe_comp.c:741 went through check_psn() ->
>> COMPST_ERROR_RETRY -> COMPST_ERROR. In that case the wqe_status should
>> have been IB_WC_RETRY_EXC_ERR and not IB_WC_SUCCESS.
>
> My conclusion was from trying to figure out why the warning was seen in var log messages which then followed with error that retry limit exceeded and connection manager dropping and re establishing the connection. Sounds like my conclusion wasnt correct.
>
> It seems like this warning is meant to signal that something went wrong in the code and this state of wqe status success yet being in error state is unexpected.
>
> I thought the developers would be interested in investigating but maybe it’s not an interesting condition. Specially since it sounds like your assessment is that packet loss causes the hiccup.
>
> I wish then there was a warning that notes packet loss and warns the user. I understand the protocol assumes lossless communication so it shouldn’t be dealing w packet loss.
>
>> Can you please be more specific and explain how did you get to this conclusion?
>
> What other specifics can I provide? I added printks trying to trace the WARN message. Should I share a patch w printks w the output?
I wonder how we get to this point when status is IB_WC_SUCCESS but
through retry exceeded error. If you traced it maybe you can explain.
>
>> BTW, what was the test you were running?
>
> I was running NFS testsuite (cthon). This was done on a laptop running 2VMs.
I don't promise that we will run it immediately but it if you provide
a HOWTO for this test I will appreciate it.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: insight into a WARNING from softROCE
       [not found]             ` <CAG9sBKOQ-gn8int2TSjnEo147jR2=Pnv7xKJqgs5iV3S7qFTEQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-01-04 21:59               ` Olga Kornievskaia
  0 siblings, 0 replies; 6+ messages in thread
From: Olga Kornievskaia @ 2018-01-04 21:59 UTC (permalink / raw)
  To: Moni Shoua; +Cc: linux-rdma

On Mon, Dec 25, 2017 at 6:09 AM, Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> On Fri, Dec 22, 2017 at 5:43 PM, Olga Kornievskaia
> <olga.kornievskaia-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>
>> Hi Moni,
>>
>>> On Dec 21, 2017, at 3:19 AM, Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>>>
>>> Hi Olga
>>> As far as I can tell the warning in
>>> drivers/infiniband/sw/rxe/rxe_comp.c:741 went through check_psn() ->
>>> COMPST_ERROR_RETRY -> COMPST_ERROR. In that case the wqe_status should
>>> have been IB_WC_RETRY_EXC_ERR and not IB_WC_SUCCESS.
>>
>> My conclusion was from trying to figure out why the warning was seen in var log messages which then followed with error that retry limit exceeded and connection manager dropping and re establishing the connection. Sounds like my conclusion wasnt correct.
>>
>> It seems like this warning is meant to signal that something went wrong in the code and this state of wqe status success yet being in error state is unexpected.
>>
>> I thought the developers would be interested in investigating but maybe it’s not an interesting condition. Specially since it sounds like your assessment is that packet loss causes the hiccup.
>>
>> I wish then there was a warning that notes packet loss and warns the user. I understand the protocol assumes lossless communication so it shouldn’t be dealing w packet loss.
>>
>>> Can you please be more specific and explain how did you get to this conclusion?
>>
>> What other specifics can I provide? I added printks trying to trace the WARN message. Should I share a patch w printks w the output?
> I wonder how we get to this point when status is IB_WC_SUCCESS but
> through retry exceeded error. If you traced it maybe you can explain.

Sorry I was pulled off into doing testing over hardware RDMA and
haven't gotten back to softRoce. I will try get back it soon. I did
re-run this on the real machines (instead of VMs) and I saw the same
WARNING (on 4.15-rc4) kernel. In this case, softROCE was run over 10G
NICs using ibg drivers.

>>> BTW, what was the test you were running?
>>
>> I was running NFS testsuite (cthon). This was done on a laptop running 2VMs.
> I don't promise that we will run it immediately but it if you provide
> a HOWTO for this test I will appreciate it.

git clone git://git.linux-nfs.org/projects/steved/cthon04.git
cd cthon04
make
mount -o vers=4.1,rdma,port=20049 <serverip>:<servermountpoint> /mnt
(if instructions are needed to setup rdma nfs it would really depend
on which distro you are using. Redhat has howto to setup NFSoRDMA).
./runtest -a -f /mnt/data
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-01-04 21:59 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-12-08 19:50 insight into a WARNING from softROCE Olga Kornievskaia
     [not found] ` <CAN-5tyFENoK6f20zjeUEbXNQh-bZAaH-iBf4Q6G8uLjz0eHnqA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-12-19 13:15   ` Leon Romanovsky
2017-12-21  8:19   ` Moni Shoua
     [not found]     ` <CAG9sBKM2qTSrB2sZnrLsUxAhL+ADursUTyPojd8HY5PF4Nf9Zg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-12-22 15:43       ` Olga Kornievskaia
     [not found]         ` <581B7086-DBD2-4121-8A71-03E10757B4A1-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-12-25 11:09           ` Moni Shoua
     [not found]             ` <CAG9sBKOQ-gn8int2TSjnEo147jR2=Pnv7xKJqgs5iV3S7qFTEQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-01-04 21:59               ` Olga Kornievskaia

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox