* kitten - mlx4: Unhandled interrupt - owner bit
@ 2010-03-10 15:03 Fredrik Unger
[not found] ` <4B97B4BE.1050809-e+cCxrzAqRFWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 4+ messages in thread
From: Fredrik Unger @ 2010-03-10 15:03 UTC (permalink / raw)
To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Hi,
I am new to this list, and if my question is misplaced
please suggest a better forum on or off-list.
We are using InfiniBand (core & mlx4 of OFED 1.4.1 + OFED kernel patches)
in a light weight kernel named kitten, partially derived from linux.
http://code.google.com/p/kitten/
We see problems of one or two unhandled interrupts when doing RDMA_READ
data transfers with mlx4 cards. (SEND and RDMA_WRITE works well)
It appears only with larger messages 1-4 Mb.
write-combining is turned off.
Below a pingpong test - 1000 iterations per messages size:
ex.
<8>(init_task) Size Average Stddev Min Median Max
...
<8>(init_task) 524288 271.79 7.09 138.96 271.51 429.24
<4>irq_dispatch: Unhandled interrupt 74 (4a) [Owner]
<8>(init_task) 1048576 569.99 981.73 272.01 537.56 31581.67
<8>(init_task) 2097152 1070.57 28.95 537.88 1069.66 1779.97
<8>(init_task) 4194304 2135.99 52.86 1070.10 2134.70 3124.28
This error is random and appears in about one of three runs. Note the high max
value for one 1Mb message, as I guess the connection recovers.
When investigating the error it seems to stem from next_eqe_sw in drivers/net/mlx4/eq.c
called by the interrupt handler.
What happens is that (eqe->owner & 0x80) is true causing the routine to return
NULL resulting in an unhandled interrupt (eg the interrupt routine returns 0)
My understanding is that when the interrupt gets flagged the card would
have given the eqe (event queue entry?) to the software, but it could very well be more complex.
The same message can be seen when starting the driver, but it does not cause any problems :
<6>mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0 (April 4, 2008)
<4>irq_dispatch: Unhandled interrupt 74 (4a) [Owner]
.... x 16
This problem could not be reproduced under linux so far.
The kitten interrupt handler is simple and just forwards the interrupt to the driver.
What does owner in the eqe struct mean ? Hardware or Software owns the entry ?
Has this bug been seen in Linux, even if we were not able to reproduce it ?
Can I get more debug information from the card ?
Any tips to what could go wrong in this context ? Are we missing some setup ?
Sincerely,
Fredrik Unger
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: kitten - mlx4: Unhandled interrupt - owner bit
[not found] ` <4B97B4BE.1050809-e+cCxrzAqRFWk0Htik3J/w@public.gmane.org>
@ 2010-03-10 16:35 ` Eli Cohen
[not found] ` <20100310163521.GB18440-8YAHvHwT2UEvbXDkjdHOrw/a8Rv0c6iv@public.gmane.org>
0 siblings, 1 reply; 4+ messages in thread
From: Eli Cohen @ 2010-03-10 16:35 UTC (permalink / raw)
To: Fredrik Unger; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
On Wed, Mar 10, 2010 at 04:03:26PM +0100, Fredrik Unger wrote:
>
> When investigating the error it seems to stem from next_eqe_sw in drivers/net/mlx4/eq.c
> called by the interrupt handler.
> What happens is that (eqe->owner & 0x80) is true causing the routine to return
> NULL resulting in an unhandled interrupt (eg the interrupt routine returns 0)
Please note that the condition is a bit more complicated. I quote the
whole function:
static struct mlx4_eqe *next_eqe_sw(struct mlx4_eq *eq)
{
struct mlx4_eqe *eqe = get_eqe(eq, eq->cons_index);
return !!(eqe->owner & 0x80) ^ !!(eq->cons_index & eq->nent) ? NULL : eqe;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: kitten - mlx4: Unhandled interrupt - owner bit
[not found] ` <20100310163521.GB18440-8YAHvHwT2UEvbXDkjdHOrw/a8Rv0c6iv@public.gmane.org>
@ 2010-03-10 19:39 ` Fredrik Unger
[not found] ` <4B97F587.3000209-e+cCxrzAqRFWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 4+ messages in thread
From: Fredrik Unger @ 2010-03-10 19:39 UTC (permalink / raw)
To: Eli Cohen; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Eli Cohen wrote:
> On Wed, Mar 10, 2010 at 04:03:26PM +0100, Fredrik Unger wrote:
>> When investigating the error it seems to stem from next_eqe_sw in drivers/net/mlx4/eq.c
>> called by the interrupt handler.
>> What happens is that (eqe->owner & 0x80) is true causing the routine to return
>> NULL resulting in an unhandled interrupt (eg the interrupt routine returns 0)
>
> Please note that the condition is a bit more complicated. I quote the
> whole function:
>
> static struct mlx4_eqe *next_eqe_sw(struct mlx4_eq *eq)
> {
> struct mlx4_eqe *eqe = get_eqe(eq, eq->cons_index);
> return !!(eqe->owner & 0x80) ^ !!(eq->cons_index & eq->nent) ? NULL : eqe;
> }
Yes you are correct,
To clarify I checked each of the statements separatly and from what I
could gather
(eqe->owner & 0x80) was true and
(eq->cons_index & eq->nent) false.
But true! As I am not sure what each statement hides,
I do not know if both should be false or true for the eqe to be
returned. Will try to check the cons_index closer.
Where could I find out more about owner and cons_index / nent ?
Thank you,
Fredrik Unger
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: kitten - mlx4: Unhandled interrupt - owner bit
[not found] ` <4B97F587.3000209-e+cCxrzAqRFWk0Htik3J/w@public.gmane.org>
@ 2010-03-10 20:00 ` Roland Dreier
0 siblings, 0 replies; 4+ messages in thread
From: Roland Dreier @ 2010-03-10 20:00 UTC (permalink / raw)
To: Fredrik Unger; +Cc: Eli Cohen, linux-rdma-u79uwXL29TY76Z2rM5mHXA
> To clarify I checked each of the statements separatly and from what I
> could gather
> (eqe->owner & 0x80) was true and
> (eq->cons_index & eq->nent) false.
> But true! As I am not sure what each statement hides,
> I do not know if both should be false or true for the eqe to be
> returned. Will try to check the cons_index closer.
As the '^' (XOR) implies, they should be the same for the EQE to be
returned.
> Where could I find out more about owner and cons_index / nent ?
The ConnectX programmer's reference manual is what you need.
--
Roland Dreier <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2010-03-10 20:00 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-10 15:03 kitten - mlx4: Unhandled interrupt - owner bit Fredrik Unger
[not found] ` <4B97B4BE.1050809-e+cCxrzAqRFWk0Htik3J/w@public.gmane.org>
2010-03-10 16:35 ` Eli Cohen
[not found] ` <20100310163521.GB18440-8YAHvHwT2UEvbXDkjdHOrw/a8Rv0c6iv@public.gmane.org>
2010-03-10 19:39 ` Fredrik Unger
[not found] ` <4B97F587.3000209-e+cCxrzAqRFWk0Htik3J/w@public.gmane.org>
2010-03-10 20:00 ` Roland Dreier
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox