From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lon Hohberger Date: Fri, 13 Jul 2012 16:16:41 -0400 Subject: [Cluster-devel] [PATCH] qdiskd: Make multipath issues go away In-Reply-To: <1341237304-4691-1-git-send-email-fdinitto@redhat.com> References: <1341237304-4691-1-git-send-email-fdinitto@redhat.com> Message-ID: <50008229.2050900@redhat.com> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit On 07/02/2012 09:55 AM, Fabio M. Di Nitto wrote: > From: Lon Hohberger > > Qdiskd hsitorically has required significant tuning to work around > delays which occur during multipath failover, overloaded I/O, and LUN > trespasses in both device-mapper-multipath and EMC PowerPath > environments. > > This patch goes a very long way towards eliminating false evictions > when these conditions occur by making qdiskd whine to the other > cluster members when it detects hung system calls. When a cluster > member whines, it indicates the source of the problem (which system > call is hung), and the act of receiving a whine from a host indicates > that qdiskd is operational, but that I/O is hung. Hung I/O is different > from losing storage entirely (where you get I/O errors). > > Possible problems: > > - Receive queue getting very full, causing messages to become blocked on > a node where I/O is hung. 1) that would take a very long time, and 2) > node should get evicted at that point anyway. > > Resolves: rhbz#782900 > > this version of the patch is a backport of: > e2937eb33f224f86904fead08499a6178868ca6a > 34d2872fb7e60be1594158acaaeb8acd74f78d22 > > There is a minor change vs original patch based on how qdiskd > in RHEL5 handles cman connection. We add an extra call to cman_alive > in main qdisk_loop to make sure data are not stalled on the > cman port, and data_callback to qdiskd_whine executed. > > Signed-off-by: Lon Hohberger > Signed-off-by: Fabio M. Di Nitto Re-ack :)