From mboxrd@z Thu Jan 1 00:00:00 1970 From: NeilBrown Subject: Re: question about MD raid rebuild performance degradation even with speed_limit_min/speed_limit_max set. Date: Wed, 29 Oct 2014 13:57:49 +1100 Message-ID: <20141029135749.241f9e50@notabene.brown> References: <5445332B.9060009@cse.yorku.ca> <5445361E.3010503@cse.yorku.ca> <5445799A.8020205@cse.yorku.ca> <20141029093822.79242658@notabene.brown> <5450521F.8060309@cse.yorku.ca> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; boundary="Sig_/VbZ11YTqciNosa6Naw8pJln"; protocol="application/pgp-signature" Return-path: In-Reply-To: <5450521F.8060309@cse.yorku.ca> Sender: linux-raid-owner@vger.kernel.org To: Jason Keltz Cc: linux-raid@vger.kernel.org List-Id: linux-raid.ids --Sig_/VbZ11YTqciNosa6Naw8pJln Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable On Tue, 28 Oct 2014 22:34:07 -0400 Jason Keltz wrote: > On 28/10/2014 6:38 PM, NeilBrown wrote: > > On Mon, 20 Oct 2014 17:07:38 -0400 Jason Keltz wrote: > > > >> On 10/20/2014 12:19 PM, Jason Keltz wrote: > >>> Hi. > >>> > >>> I'm creating a 22 x 2 TB SATA disk MD RAID10 on a new RHEL6 system. > >>> I've experimented with setting "speed_limit_min" and "speed_limit_max" > >>> kernel variables so that I get the best balance of performance during > >>> a RAID rebuild of one of the RAID1 pairs. If, for example, I set > >>> speed_limit_min AND speed_limit_max to 80000 then fail a disk when > >>> there is no other disk activity, then I do get a rebuild rate of > >>> around 80 MB/s. However, if I then start up a write intensive > >>> operation on the MD array (eg. a dd, or a mkfs on an LVM logical > >>> volume that is created on that MD), then, my write operation seems to > >>> get "full power", and my rebuild drops to around 25 MB/s. This means > >>> that the rebuild of my RAID10 disk is going to take a huge amount of > >>> time (>12 hours!!!). When I set speed_limit_min and speed_limit_max to > >>> the same value, am I not guaranteeing the rebuild speed? Is this a bug > >>> that I should be reporting to Red Hat, or a "feature"? > >>> > >>> Thanks in advance for any help that you can provide... > >>> > >>> Jason. > >> I would like to add that I downloaded the latest version of Ubuntu, and > >> am running it on the same server with the same MD. > >> When I set speed_limit_min and speed_limit_max to 80000, I was able to > >> start two large dds on the md array, and the rebuild stuck at around 71 > >> MB/s, which is close enough. This leads me to believe that the problem > >> above is probably a RHEL6 issue. However, after I stopped the two dd > >> operations, and raised both speed_limit_min and speed_limit_max to > >> 120000, the rebuild stayed between 71-73 Mb/s for more than 10 minutes > >> .. now it seems to be at 100 MB/s... but doesn't seem to get any higher > >> (even though I had 120 MB/s and above on the RHEL system without any > >> load)... Hmm. > >> > > md certainly cannot "guarantee" any speed - it can only deliver what the > > underlying devices deliver. > > I know the kernels logs say something about a "guarantee". That was ad= ded > > before my time and I haven't had occasion to remove it. > > > > md will normally just try to recover as fast as it can unless that exce= eds > > one of the limits - then it will back-off. > > What speed it actually achieved depends on other load and the behaviour= of > > the IO scheduler. > > > > "RHEL6" and "Ubuntu" don't mean a lot to me. Specific kernel version m= ight, > > though in the case of Redhat I know that backport lots of stuff so even= the > > kernel version isn't very helpful. I'm must prefer having report again= st > > mainline kernels. > > > > Rotating drives do get lower transfer speeds at higher addresses. That= might > > explain the 120 / 100 difference. > Hi Neil, > Thanks very much for your response. > I must say that I'm a little puzzled though. I'm coming from using a=20 > 3Ware hardware RAID controller where I could configure how much of the=20 > disk bandwidth is to be used for a rebuild versus I/O. From what I=20 > understand, you're saying that MD can only use the disk bandwidth=20 > available to it. It seems that it doesn't take any priority in the I/O=20 > chain. It will only attempt to use no less than min bandwidth, and no=20 > more than max bandwidth for the rebuild, but if you're on a busy system,= =20 > and other system I/O needs that disk bandwidth, then there's nothing it=20 > can do about it. I guess I just don't understand why. Why can't md be=20 > given a priority in the kernel to allow the admin to decide how much=20 > bandwidth goes to system I/O versus rebuild I/O. Even in a busy system,= =20 > I still want to allocate at least some minimum bandwidth to MD. In=20 > fact, in the event of a disk failure, I want to have a whole lot of the=20 > disk bandwidth dedicated to MD. It's something about short term pain=20 > for long term gain? I'd rather not have the users suffer at all, but if=20 > they do have to suffer, I'd rather them suffer for a few hours, knowing=20 > that after that, the RAID system is in a perfectly good state with no=20 > bad disks as opposed to letting a bad disk resync take days because the=20 > system is really busy... days during which another failure might occur! >=20 > Jason. It isn't so much "that MD can only use..." but rather "that MD does only use ...". This is how the code has "always" worked and no-one has ever bothered to change it, or to ask for it to be changed (that I recall). There are difficulties in guaranteeing a minimum when the array uses partitions from devices on which other partitions are used for other things. In that case I don't think it is practical to make guarantees, but that needn't stop us making guarantees when we can I guess. If the configured bandwidth exceeded the physically available bandwidth I don't think we would want to exclude non-resync IO completely, so the guaranty would have to be: N MB/sec or M% of available, whichever is less We could even implement the different approach in a back-compatible way. Introduce a new setting "max_sync_percent". By default that is unset and t= he current algorithm applies. If it is set to something below 100, non-resync IO is throttled to an appropriate fraction of the actual resync throughput whenever that is below sync_speed_min. Or something like that. Some care would be needed in comparing throughput and sync throughput is measured per-device, while non-resync throughput might be measured per-arra= y. Maybe the throttling would happen per-device?? All we need now is for someone to firm up the design and then write the cod= e. NeilBrown --Sig_/VbZ11YTqciNosa6Naw8pJln Content-Type: application/pgp-signature Content-Description: OpenPGP digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIVAwUBVFBXrTnsnt1WYoG5AQIOzhAAsVF9SmGjBX6TBj4vKJrUaE/F/hRNF8k6 iuUbSsrZjHR5mA3W6XV4DIcBSXBs/LXQmK+w8XBySoAE6FNAZaEVKR0oyXb0yi0e lN4spu+FuZuHfTkbbXdxyFehmDg8xaUcWm0Tcsxda8Bdy3sqJ9EEU7NdyEZxq96B Vja2YvPtZlAZymRDa9DausyNuegPjLV5wmO96V7QhUr3k9zsRVh5Rgt/qSnGjE2K 84JrQG0hoAhfGYWe8YoFDGCkAtBMcvktZMPP/yAYk6ZBDao66kJG0zJM9HPO3iqb RjFnh5nlvEuM2OOqnpkIT0MYkadKLiLedEEUEcYW2PyO5obe4ItUtYtQBeYkGS3x xcSx+GVAODmugEACVUsf5lG7p/i/1/pyb6TcvpD0RuPqQddrs/TZi5OFcVFJuOjl g73nCIWXMxWqk35r/8KwhMABEtDTbDvdmPVSfgKprsbHtt99P2IgdbQNOc4AiXGA pfqA28veCSkWYRBredgAbMDZPnIWMAVMptYe4Hb4lOBnTSGxz4IaA8Eh7QkLNIMy QJB9eCwCQ/ZhQLHJ4tPGfif2OsrvPA4hPbEdUymxQ6Q7d55zSKAZzHYFpIoyd79h tHmkuuwHGlSdyTYTHi5Z7a3XqAMQkiPFnd8FhjbLPl7SC1JebruNgx65rNob71H0 nal2q5LIxWc= =elam -----END PGP SIGNATURE----- --Sig_/VbZ11YTqciNosa6Naw8pJln--