All of lore.kernel.org
 help / color / mirror / Atom feed
* [Cluster-devel] DLM + SCTP bug (was Re: [DRBD-user] kernel panic with DRBD: solved)
       [not found] ` <4E6E36C9.9070401@gmx.net>
@ 2011-09-12 17:18   ` Fabio M. Di Nitto
  2011-09-12 18:00     ` David Teigland
  0 siblings, 1 reply; 2+ messages in thread
From: Fabio M. Di Nitto @ 2011-09-12 17:18 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On 9/12/2011 6:43 PM, Florian Haas wrote:
> On 2011-09-12 17:52, Roth, Steven (ESS Software) wrote:
>> I have attempted to debug the kernel panic that I reported on this list
>> last week, which has been reported by several others as well.  The panic
>> happens when DRBD is used in clusters based on corosync (either RHCS or
>> Pacemaker), but only when those clusters are configured with multiple
>> heartbeats (i.e., with ?altname? specifications for the cluster nodes). 
>> The panic appears to be caused by two defects, one in the distributed
>> lock manager (DLM, used by corosync) and one in the SCTP network
>> protocol (which is used in clusters with multiple heartbeats).  DRBD
>> code triggers the panic but appears to be blameless for it.
>>
>>  
>>
>> Disclaimer:  I am not a Linux kernel expert; all of my kernel debugging
>> expertise is on a different flavor of Unix.  My assumptions or
>> conclusions may be incorrect; I do not guarantee 100% accuracy of this
>> analysis.  Caveat lector.
>>
>>  
>>
>> Environment:  As will be clear from the analysis below, this defect can
>> manifest in many ways.  I debugged a particular manifestation that
>> occurred with DRBD 8.4.0 running on kernel 2.6.32-71.29.1.el6.x86_64
>> (i.e., RHEL/CentOS 6.0).  The manifestation I debugged was running a two
>> node cluster, shutting down node A and starting it back up.  Node B
>> panics as soon as Node A starts back up.  (See my previous mail for the
>> defect signature.)
>>
>>  
>>
>> When the cluster starts up, it creates a DLM ?lockspace?.  This causes
>> the DLM code to create a socket for communication with the other nodes. 
>> Since we?re configured for multiple heartbeats, it?s an SCTP socket. 
>> DLM also creates a bunch of new kernel threads, among which is the
>> dlm_recv thread, which listens for traffic on that socket.  (Actually I
>> see two of them, one per CPU.)  You can see this in a ?ps? listing.
>>
>>  
>>
>> An important thing to note here is that all kernel threads are part of
>> the same pseudo-process, and as such, they all share the same set of
>> file descriptors.  However, kernel threads do not normally (ever?) use
>> file descriptors; they tend to work with file structures directly.  The
>> SCTP socket created above, for example, has the appropriate in-kernel
>> socket structure, file structure, and inode structure, but it does not
>> have a file descriptor.  That?s as it should be.
>>
>>  
>>
>> When node A starts back up, the SCTP protocol notices this (as it?s
>> supposed to), and delivers an SCTP_ASSOC_CHANGE / SCTP_RESTART
>> notification to the SCTP socket, telling the socket owner (the dlm_recv
>> thread) that the other node has restarted.  DLM responds by telling SCTP
>> to create a clone of the master socket, for use in communicating with
>> the newly restarted node.  (This is an SCTP_SOCKOPT_PEELOFF request.) 
>> And this is where things go wrong: the SCTP_SOCKOPT_PEELOFF request is
>> designed to be called from user space, not from a kernel thread, and so
>> it /does/ allocate a file descriptor for the new socket.  Since DLM is
>> calling it from a kernel thread, the kernel thread now has an open file
>> descriptor (#0) to that socket.  And since kernel threads share the same
>> file descriptor, every kernel thread on the system has this open
>> descriptor.  So defect #1 is that DLM is calling an SCTP user-space
>> interface from a kernel thread, which results in pollution of the kernel
>> thread file descriptor table.
>>
>>  
>>
>> Meanwhile, DRBD has its own in-kernel code, running in a different
>> kernel thread.  And it detects (I didn?t bother to track down how) that
>> its peer is back online.  DRBD allows the user to configure handlers for
>> events like that: user space programs that should be called when such an
>> event occurs.  So when DRBD notices that its peer is back, its kernel
>> thread uses call_userhelper() to start a user-space instance of drbdadm
>> to invoke any appropriate handlers.  This is the invocation of drbdadm
>> that we see in the panic report.  (drbdadm gets invoked this way in
>> response to a number of other possible events, as well, so this panic
>> can manifest itself in other ways.)
>>
>>  
>>
>> The key thing about this instance of drbdadm is that it was invoked by a
>> kernel thread.  Therefore it shouldn?t have any open file descriptors ?
>> but in this case, it does: it inherits fd 0 pointing to the SCTP
>> socket.  One of the first things that drbdadm does, when starting up, is
>> call isatty(stdin) to find out how it should format its output.  If it
>> were called from user space, that would correctly check whether standard
>> input was interactive.  If it were called correctly from a kernel
>> thread, there would be no stdin and it would correctly return an error. 
>> But what actually happens is that it calls isatty on the SCTP socket
>> that is (incorrectly) in file descriptor 0.
>>
>>  
>>
>> When ioctl is called on a socket, the sock_ioctl() function dereferences
>> the socket data structure pointer (sk).  Defect #2 is that the offending
>> socket in this case has a null sk pointer.  (I did not track down why,
>> but presumably it?s a problem with the SCTP peel-off code.)  So when
>> sock_ioctl() derefences the pointer, the kernel panics.
>>
>>  
>>
>> So, to recap:  this panic occurs because (a) the drbdadm process is
>> erroneously given an SCTP socket as its standard input, and (b) that
>> socket?s data pointer is null, so it panics when drbdadm (reasonably)
>> makes an ioctl call on its standard input.
>>
>>  
>>
>> If you need a workaround for this panic, the best I can offer is to
>> remove the ?altname? specifications from the cluster configuration, set
>> <totem rrp_mode=?none?> and <dlm protocol=?tcp?>, so that corosync uses
>> TCP sockets instead of SCTP sockets.
> 
> Wow. Pretty awesome analysis. This is something that the openais
> (Corosync) mailing list should know about, but it currently seems
> affected by the LF breach/outage. Thus, I'm CC'ing two key people here
> directly. Fabio and Steve, could you take a look into this please?

I am CC David and cluster-devel. David maintains DLM in kernel and userland.

A few quick notes about using RRP/altname in a more general fashion.

RRP/altname is expected to be Technology Preview state starting from
RHEL6.2 (the technology will be there for users to test/try but not
officially supported for production yet). We have not done a lot of
intensive testing on the overall RHCS stack yet (except corosync, that
btw does not use DLM) so there might be (== there are) bugs that we will
have to address. Packages in RHEL6.2/Centos6.2 will have reasonable
defaults and they are expected to work better (but far from being bug
free) vs RHEL6.0/Centos6.0.

It will take sometime before all the stack will be fully
tested/supported in such environment but it is a work in progress.

This report is extremely useful and surely will speed up things a lot.

Thanks
Fabio



^ permalink raw reply	[flat|nested] 2+ messages in thread

* [Cluster-devel] DLM + SCTP bug (was Re: [DRBD-user] kernel panic with DRBD: solved)
  2011-09-12 17:18   ` [Cluster-devel] DLM + SCTP bug (was Re: [DRBD-user] kernel panic with DRBD: solved) Fabio M. Di Nitto
@ 2011-09-12 18:00     ` David Teigland
  0 siblings, 0 replies; 2+ messages in thread
From: David Teigland @ 2011-09-12 18:00 UTC (permalink / raw)
  To: cluster-devel.redhat.com

> >> When node A starts back up, the SCTP protocol notices this (as it?s
> >> supposed to), and delivers an SCTP_ASSOC_CHANGE / SCTP_RESTART
> >> notification to the SCTP socket, telling the socket owner (the dlm_recv
> >> thread) that the other node has restarted.  DLM responds by telling SCTP
> >> to create a clone of the master socket, for use in communicating with
> >> the newly restarted node.  (This is an SCTP_SOCKOPT_PEELOFF request.) 
> >> And this is where things go wrong: the SCTP_SOCKOPT_PEELOFF request is
> >> designed to be called from user space, not from a kernel thread, and so
> >> it /does/ allocate a file descriptor for the new socket.  Since DLM is
> >> calling it from a kernel thread, the kernel thread now has an open file
> >> descriptor (#0) to that socket.  And since kernel threads share the same
> >> file descriptor, every kernel thread on the system has this open
> >> descriptor.  So defect #1 is that DLM is calling an SCTP user-space
> >> interface from a kernel thread, which results in pollution of the kernel
> >> thread file descriptor table.

Thanks for that analysis.  As you point out, SCTP is only ever really used
or tested from user space, not from the kernel like the dlm does.  So I'm
not surprised to hear about problems like this.  I don't know how
difficult it might be to fix that.  I'd also expect to find other problems
like it with dlm+sctp.  Some experienced time and attention is probably
needed to move the dlm's sctp support beyond experimental.

Dave



^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2011-09-12 18:00 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <3D45CC2FAB0DF644BA2377554659F874034C05@G4W3297.americas.hpqcorp.net>
     [not found] ` <4E6E36C9.9070401@gmx.net>
2011-09-12 17:18   ` [Cluster-devel] DLM + SCTP bug (was Re: [DRBD-user] kernel panic with DRBD: solved) Fabio M. Di Nitto
2011-09-12 18:00     ` David Teigland

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.