* [lustre-devel] Channel Bonding Debug Information
@ 2015-09-28 19:30 Amir Shehata
2015-09-29 21:10 ` Ashish Purkar
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: Amir Shehata @ 2015-09-28 19:30 UTC (permalink / raw)
To: lustre-devel
Hello,
As a followup on the discussion in the LAD developer summit, regarding
ensuring that there is enough debug information provided as part of the
Channel Bonding solution, I'm sending this email to ask for ideas on what
type of debug information you would like to see.
thanks
amir
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20150928/0fc1d3da/attachment.htm>
^ permalink raw reply [flat|nested] 4+ messages in thread
* [lustre-devel] Channel Bonding Debug Information
2015-09-28 19:30 [lustre-devel] Channel Bonding Debug Information Amir Shehata
@ 2015-09-29 21:10 ` Ashish Purkar
2015-10-01 14:31 ` Olaf Weber
2015-10-02 15:03 ` DEGREMONT Aurelien
2 siblings, 0 replies; 4+ messages in thread
From: Ashish Purkar @ 2015-09-29 21:10 UTC (permalink / raw)
To: lustre-devel
It's useful to have Rx, Tx stats counters available from NIDs participating
in channel bonding and SG list usage(Rx,Tx,command/status ring descriptors)
to debug performance problems seen with LACP bonding.
app?
On Sep 29, 2015 1:00 AM, "Amir Shehata" <amir.shehata.whamcloud@gmail.com>
wrote:
> Hello,
>
> As a followup on the discussion in the LAD developer summit, regarding
> ensuring that there is enough debug information provided as part of the
> Channel Bonding solution, I'm sending this email to ask for ideas on what
> type of debug information you would like to see.
>
> thanks
> amir
>
> _______________________________________________
> lustre-devel mailing list
> lustre-devel at lists.lustre.org
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddevel-2Dlustre.org&d=BQICAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=FtYV9f_ig6ynAGsdwsQr2_tmRri3ty7J2xOP7XSVZbg&m=_S5QRz-57VM5yeaQJZCqqS9aJS2rCxpXpxAkqPTLpb4&s=ePZoRO7uPl9fRDKoXdNMzz_YNyHPslWDcMNBx34caJk&e=
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20150930/96672af6/attachment.htm>
^ permalink raw reply [flat|nested] 4+ messages in thread
* [lustre-devel] Channel Bonding Debug Information
2015-09-28 19:30 [lustre-devel] Channel Bonding Debug Information Amir Shehata
2015-09-29 21:10 ` Ashish Purkar
@ 2015-10-01 14:31 ` Olaf Weber
2015-10-02 15:03 ` DEGREMONT Aurelien
2 siblings, 0 replies; 4+ messages in thread
From: Olaf Weber @ 2015-10-01 14:31 UTC (permalink / raw)
To: lustre-devel
On 28-09-15 21:30, Amir Shehata wrote:
> Hello,
>
> As a followup on the discussion in the LAD developer summit, regarding
> ensuring that there is enough debug information provided as part of the
> Channel Bonding solution, I'm sending this email to ask for ideas on what
> type of debug information you would like to see.
>
> thanks
> amir
Hi Amir,
My random and disorganized thoughts.
Significant state changes and anything unexpected should of course be logged.
In addition I'd like interfaces that allow me to efficiently get the
status/stats of a specific network interface or a specific peer, as opposed
to only being able to get the information for all interfaces or peers and
then having to filter. That may imply an ioctl type interface instead of or
in addition to debugfs or sysfs (or procfs).
For the local interfaces, stats include TX/RX counters, credits, interface
state, and some measure of how busy the interface is. The latter can be
derived by watching the TX/RX counters over time, but it would be nice to
have it calculated. A variant on the "File Heat" idea presented at LAD might
work for this. (Think decaying sum over recent activity.) When interfaces
are associated with CPTs, the CPT number -- especially important if the
kernel automatically associates an interface with a CPT.
For the peers, a way to obtain the list of peers, and then to obtain the
interfaces for each peer. Stats per peer interface include TX/RX counters
and credits, perceived health, and maybe "heat". For a peer itself possibly
totals, and peer health as perceived by the current node.
A note on calculating heat: the full list of peer interfaces becomes large
(on the servers of a large cluster) and you don't want to walk it without
needing to. If you store a timestamp for the last use, then heat can be
calculated when the TX/RX counters are updated or read, which is when the
relevant datastructure is being accessed anyway.
For local interfaces the list is likely small enough that this kind of
approach isn't worth it. Moreover the list of local interfaces might be
regularly walked to check on health etc.
Olaf
--
Olaf Weber SGI Phone: +31(0)30-6696796
Veldzigt 2b Fax: +31(0)30-6696799
Sr Software Engineer 3454 PW de Meern Vnet: 955-6796
Storage Software The Netherlands Email: olaf at sgi.com
^ permalink raw reply [flat|nested] 4+ messages in thread
* [lustre-devel] Channel Bonding Debug Information
2015-09-28 19:30 [lustre-devel] Channel Bonding Debug Information Amir Shehata
2015-09-29 21:10 ` Ashish Purkar
2015-10-01 14:31 ` Olaf Weber
@ 2015-10-02 15:03 ` DEGREMONT Aurelien
2 siblings, 0 replies; 4+ messages in thread
From: DEGREMONT Aurelien @ 2015-10-02 15:03 UTC (permalink / raw)
To: lustre-devel
Hi
As discussed at last Developer Summit, my concern is about transparent
interface switching, without upper layer knowing it.
I'm not talking about a lot of interface details, others already talked
about that. I thinking about error messages and admins which are not
Lustre experts.
This is a typically timeout error message you can get on a Lustre
client. You can see a lustre target (here MDT0000) and a NID, especially
an IP address.
[4863147.960698] Lustre:
25163:0:(client.c:1939:ptlrpc_expire_one_request()) @@@ Request sent has
timed out for slow reply: [sent 1443794470/real 1443794470]
req at ffff880612a00c00 x1509752994606324/t0(0)
o38->lustre-MDT0000-mdc-ffff88062dea2000 at 10.2.10.13@o2ib:12/10 lens
400/544 e 0 to 1 dl 1443794476 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
If this error is due to LNET taking another link, either on client side
or server side and this link is sick/flacky/buggy, ... *this should not
be silent*! Ideally this NID should be updated in this error message to
reflect the route change.
I do not have a strong opinion on the way this error should be reported,
but I just wanted the case where : the network error is reported only in
debug message and this error message is displayed as-is, without any
idea that LNET did some magic stuff that failed.
Aur?lien
Le 28/09/2015 21:30, Amir Shehata a ?crit :
> Hello,
>
> As a followup on the discussion in the LAD developer summit, regarding
> ensuring that there is enough debug information provided as part of
> the Channel Bonding solution, I'm sending this email to ask for ideas
> on what type of debug information you would like to see.
>
> thanks
> amir
>
>
> _______________________________________________
> lustre-devel mailing list
> lustre-devel at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20151002/4f6a5726/attachment.htm>
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2015-10-02 15:03 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-28 19:30 [lustre-devel] Channel Bonding Debug Information Amir Shehata
2015-09-29 21:10 ` Ashish Purkar
2015-10-01 14:31 ` Olaf Weber
2015-10-02 15:03 ` DEGREMONT Aurelien
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.