[Drbd-dev] Problem with DRBD0.7 on Debian Sarge.

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Drbd-dev] Problem with DRBD0.7 on Debian Sarge.
@ 2005-12-20 14:49 Szymon Madej
  2005-12-20 15:43 ` Lars Ellenberg
  0 siblings, 1 reply; 4+ messages in thread
From: Szymon Madej @ 2005-12-20 14:49 UTC (permalink / raw)
  To: drbd-dev

Hello!

I've strange situation at work today. I was doing reboot of secondary
node in HA HeartBeat cluster, which use DRBD to distributed data, after
recompilation of it's kernel. Old kernel lacks of High Memory Support.
I've recompilled it, installed, recompilled the DRBD module for this
kernel and installed it. Then I've executed lilo to write new bootsector
and rebooted it. Before reboot primary node has consistent data on both
DRBD devices that I'm using: drbd0 and drbd1. After reboot using my new
kernel, (secondary) when DRBD was loaded and connected to primary node
I've received such kernel mesasges (cutted out timestamp and machine name):

kernel: drbd: initialised. Version: 0.7.10 (api:77/proto:74)
kernel: drbd: SVN Revision: 1743 build by root@XXXXXXXX, 2005-09-07 15:31:27
kernel: drbd: registered as block device major 147
kernel: e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex
kernel: e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex
kernel: drbd0: resync bitmap: bits=2979411 words=93108
kernel: drbd0: size = 11 GB (11917644 KB)
kernel: drbd0: 0 KB marked out-of-sync by on disk bit-map.
kernel: drbd0: Found 3 transactions (5 active extents) in activity log.
kernel: drbd0: drbdsetup [668]: cstate Unconfigured --> StandAlone
kernel: drbd1: resync bitmap: bits=3180224 words=99382
kernel: drbd1: size = 12 GB (12720896 KB)
kernel: drbd1: 0 KB marked out-of-sync by on disk bit-map.
kernel: drbd1: Found 4 transactions (157 active extents) in activity log.
kernel: drbd1: drbdsetup [672]: cstate Unconfigured --> StandAlone
kernel: drbd0: drbdsetup [690]: cstate StandAlone --> Unconnected
kernel: drbd0: drbd0_receiver [691]: cstate Unconnected --> WFConnection
kernel: drbd1: drbdsetup [698]: cstate StandAlone --> Unconnected
kernel: drbd1: drbd1_receiver [699]: cstate Unconnected --> WFConnection
kernel: drbd0: drbd0_receiver [691]: cstate WFConnection --> WFReportParams
kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
kernel: drbd0: Connection established.
kernel: drbd0: I am(S): 1:00000002:00000001:0000000c:00000001:01
kernel: drbd0: Peer(P): 1:00000002:00000001:0000000d:00000001:10
kernel: drbd0: drbd0_receiver [691]: cstate WFReportParams --> WFBitMapT
kernel: drbd0: Secondary/Unknown --> Secondary/Primary
kernel: drbd1: drbd1_receiver [699]: cstate WFConnection --> WFReportParams
kernel: drbd1: Handshake successful: DRBD Network Protocol version 74
kernel: drbd1: Connection established.
kernel: drbd1: I am(S): 1:00000002:00000001:0000000d:00000002:01
kernel: drbd1: Peer(P): 1:00000002:00000001:0000000e:00000002:10
kernel: drbd1: drbd1_receiver [699]: cstate WFReportParams --> WFBitMapT
kernel: drbd1: Secondary/Unknown --> Secondary/Primary
kernel: drbd1: drbd1_receiver [699]: cstate WFBitMapT --> SyncTarget
kernel: drbd1: Resync started as SyncTarget (need to sync 5268 KB [1317
bits set]).
kernel: drbd0: drbd0_receiver [691]: cstate WFBitMapT --> SyncTarget
kernel: drbd0: Resync started as SyncTarget (need to sync 0 KB [0 bits
set]).
kernel: drbd0: Resync done (total 1 sec; paused 0 sec; 0 K/sec)
kernel: drbd1: sock_recvmsg returned -14
kernel: drbd1: drbd1_receiver [699]: cstate SyncTarget --> BrokenPipe
kernel: drbd1: short read receiving data block: read -14 expected 4096
kernel: drbd1: error receiving RSDataReply, l: 4112!
kernel: drbd1: ASSERT( mdev->resync_work.cb == w_resync_inactive ) in
/usr/src/modules/drbd/drbd/drbd_receiver.c:1773
kernel: drbd1: worker terminated
kernel: drbd1: asender terminated
kernel: drbd0: drbd0_receiver [691]: cstate SyncTarget --> Connected
kernel: drbd1: drbd1_receiver [699]: cstate BrokenPipe --> Unconnected
kernel: drbd1: Connection lost.

On primary node at this moment the logs contains:

kernel: e1000: eth1: e1000_watchdog: NIC Link is Down
kernel: e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex
kernel: drbd0: drbd0_receiver [884]: cstate WFConnection --> WFReportParams
kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
kernel: drbd0: Connection established.
kernel: drbd0: I am(P): 1:00000002:00000001:0000000d:00000001:10
kernel: drbd0: Peer(S): 1:00000002:00000001:0000000c:00000001:01
kernel: drbd0: drbd0_receiver [884]: cstate WFReportParams --> WFBitMapS
kernel: drbd1: drbd1_receiver [892]: cstate WFConnection --> WFReportParams
kernel: drbd0: Primary/Unknown --> Primary/Secondary
kernel: drbd1: Handshake successful: DRBD Network Protocol version 74
kernel: drbd1: Connection established.
kernel: drbd1: I am(P): 1:00000002:00000001:0000000e:00000002:10
kernel: drbd1: Peer(S): 1:00000002:00000001:0000000d:00000002:01
kernel: drbd1: drbd1_receiver [892]: cstate WFReportParams --> WFBitMapS
kernel: drbd1: Primary/Unknown --> Primary/Secondary
kernel: drbd0: drbd0_receiver [884]: cstate WFBitMapS --> SyncSource
kernel: drbd0: Resync started as SyncSource (need to sync 0 KB [0 bits
set]).
kernel: drbd0: Resync done (total 1 sec; paused 0 sec; 0 K/sec)
kernel: drbd0: drbd0_receiver [884]: cstate SyncSource --> Connected
kernel: drbd1: drbd1_receiver [892]: cstate WFBitMapS --> SyncSource
kernel: drbd1: Resync started as SyncSource (need to sync 5268 KB [1317
bits set]).
kernel: drbd1: meta connection shut down by peer.
kernel: drbd1: drbd1_asender [29409]: cstate SyncSource --> NetworkFailure
kernel: drbd1: asender terminated
kernel: drbd1: drbd1_receiver [892]: cstate NetworkFailure --> BrokenPipe
kernel: drbd1: _drbd_send_page: size=4096 len=2640 sent=-104
kernel: drbd1: drbd_send_block() failed
kernel: drbd1: short read expecting header on sock: r=-512
kernel: drbd1: worker terminated
kernel: drbd1: drbd1_receiver [892]: cstate BrokenPipe --> Unconnected
kernel: drbd1: Connection lost.

And then DRBD on both nodes went into infinite loop, trying to be synced.
Both nodes are identical machines, running Debian Sarge with 2.6.8
kernel. DRBD module is compiled and installed from Debian source package
version 0.7.10. The eth0 is primary network device, eth1 is connected to
each other with crossed cable - and used only for DRBD synchronization
and HeartBeat. Both eth0 and eth1 are Intel gigabit cards - using driver
e1000. The only change I've done in kernel is to turn on the High Memory
Support.

Any ideas, what currently has happened? I'm afraid of consistency of my
data - because this cluster contains very important data for the company.

Thanks in advance
Szymon Madej

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Drbd-dev] Problem with DRBD0.7 on Debian Sarge.
  2005-12-20 14:49 [Drbd-dev] Problem with DRBD0.7 on Debian Sarge Szymon Madej
@ 2005-12-20 15:43 ` Lars Ellenberg
  2005-12-21  8:11   ` Szymon Madej
  0 siblings, 1 reply; 4+ messages in thread
From: Lars Ellenberg @ 2005-12-20 15:43 UTC (permalink / raw)
  To: drbd-dev

/ 2005-12-20 15:49:26 +0100
\ Szymon Madej:
> Hello!
> 
> I've strange situation at work today. I was doing reboot of secondary
> node in HA HeartBeat cluster, which use DRBD to distributed data, after
> recompilation of it's kernel. Old kernel lacks of High Memory Support.
> I've recompilled it, installed, recompilled the DRBD module for this
> kernel and installed it. Then I've executed lilo to write new bootsector
> and rebooted it. Before reboot primary node has consistent data on both
> DRBD devices that I'm using: drbd0 and drbd1. After reboot using my new
> kernel, (secondary) when DRBD was loaded and connected to primary node
> I've received such kernel mesasges (cutted out timestamp and machine name):

> kernel: drbd0: Resync done (total 1 sec; paused 0 sec; 0 K/sec)
> kernel: drbd1: sock_recvmsg returned -14
> kernel: drbd1: drbd1_receiver [699]: cstate SyncTarget --> BrokenPipe
> kernel: drbd1: short read receiving data block: read -14 expected 4096
> kernel: drbd1: error receiving RSDataReply, l: 4112!

you probably hit the bug which was fixed in 0.7.12:
 * Fixed a connection flip-flop bug when the two peers used different
    user provided sizes.

to verify this, first, do "drbdadm disconnect <bad_resource>".
then "drbdsetup /dev/drbdX show", as well as "cat /proc/partitions",
on both nodes.  compare the results.

the solution is probably to either make sure (using some --size
parameter if possible) that your devices are of the very same size,
or upgrade to 0.7.15, which should fix the problem.

-- 
: Lars Ellenberg                                  Tel +43-1-8178292-0  :
: LINBIT Information Technologies GmbH            Fax +43-1-8178292-82 :
: Schoenbrunner Str. 244, A-1120 Vienna/Europe   http://www.linbit.com :

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Drbd-dev] Problem with DRBD0.7 on Debian Sarge.
  2005-12-20 15:43 ` Lars Ellenberg
@ 2005-12-21  8:11   ` Szymon Madej
  2005-12-21  8:56     ` Lars Ellenberg
  0 siblings, 1 reply; 4+ messages in thread
From: Szymon Madej @ 2005-12-21  8:11 UTC (permalink / raw)
  To: drbd-dev

Thanks for fast answer.

>>kernel: drbd0: Resync done (total 1 sec; paused 0 sec; 0 K/sec)
>>kernel: drbd1: sock_recvmsg returned -14
>>kernel: drbd1: drbd1_receiver [699]: cstate SyncTarget --> BrokenPipe
>>kernel: drbd1: short read receiving data block: read -14 expected 4096
>>kernel: drbd1: error receiving RSDataReply, l: 4112!
>>    
>>
>
>you probably hit the bug which was fixed in 0.7.12:
> * Fixed a connection flip-flop bug when the two peers used different
>    user provided sizes.
>
>to verify this, first, do "drbdadm disconnect <bad_resource>".
>then "drbdsetup /dev/drbdX show", as well as "cat /proc/partitions",
>on both nodes.  compare the results.
>
>  
>

And this is the second strange thing. The device sizes are identical on
both nodes:
primary_node# cat /proc/partitions
...
   8     8   12048718 sda8
   8     9   12851968 sda9
   8    10    1004031 sda10
 147     0   11917644 drbd0
 147     1   12720896 drbd1

secondary_node# cat /proc/partitions
...
   8     8   12048718 sda8
   8     9   12851968 sda9
   8    10    1004031 sda10
 147     0   11917644 drbd0
 147     1   12720896 drbd1

where drbd0 is built over sda8, drbd1 is built over sda9, sda10 is swap
and sda1-7 are system partitions (/ /usr /home etc.). Is there any
chance that this error could really happen?

And another thing, when secondary went into infinite loop trying to get
drbd1 in sync (every try ended with NetworkError and BrokenPipe) the
drbd1 mounted on primary as /data hanged on listing with "ls -la". The
fast and brutal solution was to disconnect both machines cross link on
eth1 (used by DRBD) and reboot both nodes, and then reconnect them...
but this is not a good  method to get HA cluster back to action, isn't
it? :-)

>the solution is probably to either make sure (using some --size
>parameter if possible) that your devices are of the very same size,
>or upgrade to 0.7.15, which should fix the problem.
>
>  
>
The company I work in, is using Debian stable tree (currently Sarge, but
some mochines are still Woody) very strictly. Packages which are not
from inside this tree are treated as suspicious, and it is required to
do extensive testing. Sarge provides DRBD in version 0.7.10 and of
course testing it never broke so it was considered stable.. untill
yesterday... but change to 0.7.15 is almost imposible :-(

Tha


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Drbd-dev] Problem with DRBD0.7 on Debian Sarge.
  2005-12-21  8:11   ` Szymon Madej
@ 2005-12-21  8:56     ` Lars Ellenberg
  0 siblings, 0 replies; 4+ messages in thread
From: Lars Ellenberg @ 2005-12-21  8:56 UTC (permalink / raw)
  To: drbd-dev, drbd-dev

/ 2005-12-21 09:11:07 +0100
\ Szymon Madej:
> Thanks for fast answer.
> 
> >>kernel: drbd0: Resync done (total 1 sec; paused 0 sec; 0 K/sec)
> >>kernel: drbd1: sock_recvmsg returned -14
> >>kernel: drbd1: drbd1_receiver [699]: cstate SyncTarget --> BrokenPipe
> >>kernel: drbd1: short read receiving data block: read -14 expected 4096
> >>kernel: drbd1: error receiving RSDataReply, l: 4112!
> >>    
> >>
> >
> >you probably hit the bug which was fixed in 0.7.12:
> > * Fixed a connection flip-flop bug when the two peers used different
> >    user provided sizes.
> >
> >to verify this, first, do "drbdadm disconnect <bad_resource>".
> >then "drbdsetup /dev/drbdX show", as well as "cat /proc/partitions",
> >on both nodes.  compare the results.
> >
> >  
> >
> 
> And this is the second strange thing. The device sizes are identical on
> both nodes:
> primary_node# cat /proc/partitions
> ...
>    8     8   12048718 sda8
>    8     9   12851968 sda9
>    8    10    1004031 sda10
>  147     0   11917644 drbd0
>  147     1   12720896 drbd1
> 
> secondary_node# cat /proc/partitions
> ...
>    8     8   12048718 sda8
>    8     9   12851968 sda9
>    8    10    1004031 sda10
>  147     0   11917644 drbd0
>  147     1   12720896 drbd1
> 
> where drbd0 is built over sda8, drbd1 is built over sda9, sda10 is swap
> and sda1-7 are system partitions (/ /usr /home etc.). Is there any
> chance that this error could really happen?

Then maybe you hit something else.  Not obvious from the logs, though,
and I am not aware of anything else with these symptoms.

> And another thing, when secondary went into infinite loop trying to get
> drbd1 in sync (every try ended with NetworkError and BrokenPipe) the
> drbd1 mounted on primary as /data hanged on listing with "ls -la". The
> fast and brutal solution was to disconnect both machines cross link on
> eth1 (used by DRBD) and reboot both nodes, and then reconnect them...
> but this is not a good  method to get HA cluster back to action, isn't
> it? :-)

drbdadm disconnect rX ; drbdadm connect rX
should have had the same effect.

-- 
: Lars Ellenberg                                  Tel +43-1-8178292-0  :
: LINBIT Information Technologies GmbH            Fax +43-1-8178292-82 :
: Schoenbrunner Str. 244, A-1120 Vienna/Europe   http://www.linbit.com :

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2005-12-21  8:56 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-12-20 14:49 [Drbd-dev] Problem with DRBD0.7 on Debian Sarge Szymon Madej
2005-12-20 15:43 ` Lars Ellenberg
2005-12-21  8:11   ` Szymon Madej
2005-12-21  8:56     ` Lars Ellenberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.