* [Lustre-devel] lustre client goes wacky?
@ 2008-02-13 0:50 Ron
2008-02-13 16:41 ` Nathaniel Rutman
0 siblings, 1 reply; 2+ messages in thread
From: Ron @ 2008-02-13 0:50 UTC (permalink / raw)
To: lustre-devel
Hi,
I don't know if this is a bug or it's it's a misconfig or something
else.
What I have is:
server = 1.6.4.1+vanilla 2.6.18.8 (mgs+2*ost+mdt all on a single
server)
clients = cvs.20080116+2.6.23.12
I mounted the server from several clients and several hours later
noticed the top display below. dmesg show some lustre errors (also
below).Can someone comment on what could be going on?
Thanks,
Ron
top - 18:28:09 up 5 days, 3:36, 1 user, load average: 12.00, 12.00,
11.94
Tasks: 168 total, 13 running, 136 sleeping, 0 stopped, 19 zombie
Cpu(s): 0.0% us, 37.5% sy, 0.0% ni, 62.5% id, 0.0% wa, 0.0% hi,
0.0% si
Mem: 16468196k total, 526828k used, 15941368k free, 42996k
buffers
Swap: 4192924k total, 0k used, 4192924k free, 294916k
cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
1533 root 20 0 0 0 0 R 100 0.0 308:54.05
ll_cfg_requeue
32071 root 20 0 0 0 0 R 100 0.0 308:15.95
socknal_reaper
32073 root 20 0 0 0 0 R 100 0.0 308:48.90
ptlrpcd
1 root 20 0 4832 588 492 R 0 0.0 0:02.48
init
2 root 15 -5 0 0 0 S 0 0.0 0:00.00
kthreadd
Lustre: OBD class driver, info at clusterfs.com
Lustre Version: 1.6.4.50
Build Version: b1_6-20080210103536-
CHANGED-.usr.src.linux-2.6.23.12-2.6.23.12
Lustre: Added LNI 192.168.241.42 at tcp [8/256]
Lustre: Accept secure, port 988
Lustre: Lustre Client File System; info at clusterfs.com
Lustre: Binding irq 17 to CPU 0 with cmd: echo 1 > /proc/irq/17/
smp_affinity
Lustre: MGC192.168.241.247 at tcp: Reactivating import
Lustre: setting import datafs-OST0002_UUID INACTIVE by administrator
request
Lustre: datafs-OST0002-osc-ffff810241ad7800.osc: set parameter
active=0
LustreError: 32181:0:(lov_obd.c:230:lov_connect_obd()) not connecting
OSC datafs-OST0002_UUID; administratively disabled
Lustre: Client datafs-client has started
Lustre: Request x7684 sent from MGC192.168.241.247 at tcp to NID
192.168.241.247 at tcp 15s ago has timed out (limit 15s).
LustreError: 166-1: MGC192.168.241.247 at tcp: Connection to service MGS
via nid 192.168.241.247 at tcp was lost; in progress operations using
this service will fail.
LustreError: 32073:0:(import.c:212:ptlrpc_invalidate_import()) MGS: rc
= -110 waiting for callback (1 != 0)
LustreError: 32073:0:(import.c:216:ptlrpc_invalidate_import()) @@@
still on sending list req at ffff81040fa14600 x7684/t0 o400-
>MGS at 192.168.241.247@tcp:26/25 lens 128/256 e 0 to 11 dl 1202843837
ref 1 fl Rpc:EXN/0/0 rc -4/0
Lustre: Request x7685 sent from datafs-MDT0000-mdc-ffff810241ad7800 to
NID 192.168.241.247 at tcp 115s ago has timed out (limit 15s).
Lustre: datafs-MDT0000-mdc-ffff810241ad7800: Connection to service
datafs-MDT0000 via nid 192.168.241.247 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: MGC192.168.241.247 at tcp: Reactivating import
Lustre: MGC192.168.241.247 at tcp: Connection restored to service MGS
using nid 192.168.241.247 at tcp.
LustreError: 32059:0:(events.c:116:reply_in_callback()) ASSERTION(ev-
>mlength == lustre_msg_early_size()) failed
LustreError: 32059:0:(tracefile.c:432:libcfs_assertion_failed()) LBUG
Call Trace:
[<ffffffff88000b53>] :libcfs:lbug_with_loc+0x73/0xc0
[<ffffffff88007bd4>] :libcfs:libcfs_assertion_failed+0x54/0x60
[<ffffffff8815c746>] :ptlrpc:reply_in_callback+0x426/0x430
[<ffffffff88027f35>] :lnet:lnet_enq_event_locked+0xc5/0xf0
[<ffffffff88028475>] :lnet:lnet_finalize+0x1e5/0x270
[<ffffffff880625d9>] :ksocklnd:ksocknal_process_receive+0x469/0xab0
[<ffffffff88060350>] :ksocklnd:ksocknal_tx_done+0x80/0x1e0
[<ffffffff8806301c>] :ksocklnd:ksocknal_scheduler+0x12c/0x7e0
[<ffffffff8024e850>] autoremove_wake_function+0x0/0x30
[<ffffffff8024e850>] autoremove_wake_function+0x0/0x30
[<ffffffff8020c918>] child_rip+0xa/0x12
[<ffffffff88062ef0>] :ksocklnd:ksocknal_scheduler+0x0/0x7e0
[<ffffffff8020c90e>] child_rip+0x0/0x12
LustreError: dumping log to /tmp/lustre-log.1202843942.32059
Lustre: Request x7707 sent from MGC192.168.241.247 at tcp to NID
192.168.241.247 at tcp 15s ago has timed out (limit 15s).
Lustre: Skipped 2 previous similar messages
^ permalink raw reply [flat|nested] 2+ messages in thread* [Lustre-devel] lustre client goes wacky?
2008-02-13 0:50 [Lustre-devel] lustre client goes wacky? Ron
@ 2008-02-13 16:41 ` Nathaniel Rutman
0 siblings, 0 replies; 2+ messages in thread
From: Nathaniel Rutman @ 2008-02-13 16:41 UTC (permalink / raw)
To: lustre-devel
The clients you pulled from CVS have a feature called adaptive timeouts
which apparently
are having an issue with your 1.6.4.1 servers. Eric, can you make sure
our interoperability
is working?
Moving this thread to lustre-discuss; devel is more for
architecture/coding stuff.
Ron wrote:
> Hi,
> I don't know if this is a bug or it's it's a misconfig or something
> else.
>
> What I have is:
> server = 1.6.4.1+vanilla 2.6.18.8 (mgs+2*ost+mdt all on a single
> server)
> clients = cvs.20080116+2.6.23.12
>
> I mounted the server from several clients and several hours later
> noticed the top display below. dmesg show some lustre errors (also
> below).Can someone comment on what could be going on?
>
> Thanks,
> Ron
>
> top - 18:28:09 up 5 days, 3:36, 1 user, load average: 12.00, 12.00,
> 11.94
> Tasks: 168 total, 13 running, 136 sleeping, 0 stopped, 19 zombie
> Cpu(s): 0.0% us, 37.5% sy, 0.0% ni, 62.5% id, 0.0% wa, 0.0% hi,
> 0.0% si
> Mem: 16468196k total, 526828k used, 15941368k free, 42996k
> buffers
> Swap: 4192924k total, 0k used, 4192924k free, 294916k
> cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
> COMMAND
> 1533 root 20 0 0 0 0 R 100 0.0 308:54.05
> ll_cfg_requeue
> 32071 root 20 0 0 0 0 R 100 0.0 308:15.95
> socknal_reaper
> 32073 root 20 0 0 0 0 R 100 0.0 308:48.90
> ptlrpcd
> 1 root 20 0 4832 588 492 R 0 0.0 0:02.48
> init
> 2 root 15 -5 0 0 0 S 0 0.0 0:00.00
> kthreadd
>
>
> Lustre: OBD class driver, info at clusterfs.com
> Lustre Version: 1.6.4.50
> Build Version: b1_6-20080210103536-
> CHANGED-.usr.src.linux-2.6.23.12-2.6.23.12
> Lustre: Added LNI 192.168.241.42 at tcp [8/256]
> Lustre: Accept secure, port 988
> Lustre: Lustre Client File System; info at clusterfs.com
> Lustre: Binding irq 17 to CPU 0 with cmd: echo 1 > /proc/irq/17/
> smp_affinity
> Lustre: MGC192.168.241.247 at tcp: Reactivating import
> Lustre: setting import datafs-OST0002_UUID INACTIVE by administrator
> request
> Lustre: datafs-OST0002-osc-ffff810241ad7800.osc: set parameter
> active=0
> LustreError: 32181:0:(lov_obd.c:230:lov_connect_obd()) not connecting
> OSC datafs-OST0002_UUID; administratively disabled
> Lustre: Client datafs-client has started
> Lustre: Request x7684 sent from MGC192.168.241.247 at tcp to NID
> 192.168.241.247 at tcp 15s ago has timed out (limit 15s).
> LustreError: 166-1: MGC192.168.241.247 at tcp: Connection to service MGS
> via nid 192.168.241.247 at tcp was lost; in progress operations using
> this service will fail.
> LustreError: 32073:0:(import.c:212:ptlrpc_invalidate_import()) MGS: rc
> = -110 waiting for callback (1 != 0)
> LustreError: 32073:0:(import.c:216:ptlrpc_invalidate_import()) @@@
> still on sending list req at ffff81040fa14600 x7684/t0 o400-
>
>> MGS at 192.168.241.247@tcp:26/25 lens 128/256 e 0 to 11 dl 1202843837
>>
> ref 1 fl Rpc:EXN/0/0 rc -4/0
> Lustre: Request x7685 sent from datafs-MDT0000-mdc-ffff810241ad7800 to
> NID 192.168.241.247 at tcp 115s ago has timed out (limit 15s).
> Lustre: datafs-MDT0000-mdc-ffff810241ad7800: Connection to service
> datafs-MDT0000 via nid 192.168.241.247 at tcp was lost; in progress
> operations using this service will wait for recovery to complete.
> Lustre: MGC192.168.241.247 at tcp: Reactivating import
> Lustre: MGC192.168.241.247 at tcp: Connection restored to service MGS
> using nid 192.168.241.247 at tcp.
> LustreError: 32059:0:(events.c:116:reply_in_callback()) ASSERTION(ev-
>
>> mlength == lustre_msg_early_size()) failed
>>
> LustreError: 32059:0:(tracefile.c:432:libcfs_assertion_failed()) LBUG
>
> Call Trace:
> [<ffffffff88000b53>] :libcfs:lbug_with_loc+0x73/0xc0
> [<ffffffff88007bd4>] :libcfs:libcfs_assertion_failed+0x54/0x60
> [<ffffffff8815c746>] :ptlrpc:reply_in_callback+0x426/0x430
> [<ffffffff88027f35>] :lnet:lnet_enq_event_locked+0xc5/0xf0
> [<ffffffff88028475>] :lnet:lnet_finalize+0x1e5/0x270
> [<ffffffff880625d9>] :ksocklnd:ksocknal_process_receive+0x469/0xab0
> [<ffffffff88060350>] :ksocklnd:ksocknal_tx_done+0x80/0x1e0
> [<ffffffff8806301c>] :ksocklnd:ksocknal_scheduler+0x12c/0x7e0
> [<ffffffff8024e850>] autoremove_wake_function+0x0/0x30
> [<ffffffff8024e850>] autoremove_wake_function+0x0/0x30
> [<ffffffff8020c918>] child_rip+0xa/0x12
> [<ffffffff88062ef0>] :ksocklnd:ksocknal_scheduler+0x0/0x7e0
> [<ffffffff8020c90e>] child_rip+0x0/0x12
>
> LustreError: dumping log to /tmp/lustre-log.1202843942.32059
> Lustre: Request x7707 sent from MGC192.168.241.247 at tcp to NID
> 192.168.241.247 at tcp 15s ago has timed out (limit 15s).
> Lustre: Skipped 2 previous similar messages
>
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
>
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2008-02-13 16:41 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-02-13 0:50 [Lustre-devel] lustre client goes wacky? Ron
2008-02-13 16:41 ` Nathaniel Rutman
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.