From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cc-smtpout1.netcologne.de ([89.1.8.211]:53927 "EHLO cc-smtpout1.netcologne.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752979AbcBINsS (ORCPT ); Tue, 9 Feb 2016 08:48:18 -0500 Subject: Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core? To: Chris Murphy References: <56A230C3.3080100@netcologne.de> <56A6082C.3030007@netcologne.de> <56A73460.7080100@netcologne.de> <56A7CF97.6030408@gmail.com> <56A88452.6020306@netcologne.de> <56A8F18E.3070400@gmail.com> <56AF676B.2070902@netcologne.de> Cc: "Austin S. Hemmelgarn" , linux-btrfs From: Christian Rohmann Message-ID: <56B9EE1E.2040000@netcologne.de> Date: Tue, 9 Feb 2016 14:48:14 +0100 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Sender: linux-btrfs-owner@vger.kernel.org List-ID: On 02/01/2016 09:52 PM, Chris Murphy wrote: >> Would some sort of stracing or profiling of the process help to narrow >> > down where the time is currently spent and why the balancing is only >> > running single-threaded? > This can't be straced. Someone a lot more knowledgeable than I am > might figure out where all the waits are with just a sysrq + t, if it > is a hold up in say parity computations. Otherwise perf which is a > rabbit hole but perf top is kinda cool to watch. That might give you > an idea where most of the cpu cycles are going if you can isolate the > workload to just the balance. Otherwise you may end up with noisy > data. My balance run is now working away since 19th of January: "885 out of about 3492 chunks balanced (996 considered), 75% left" So this will take several more WEEKS to finish. Is there really nothing anyone here wants me to do or analyze to help finding the root cause of this? I mean with this kind of performance there is no way a RAID6 can be used in production. Not because the code is not stable or functioning, but because regular maintenance like replacing a drive or growing an array takes WEEKS in which another maintenance procedure could be necessary or, much worse, another drive might have failed. What I'm saying is: Such a slow RAID6 balance renders the redundancy unusable because drives might fail quicker than the potential rebuild (read "balance"). Regards Christian