From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail2.ii.nl (mail2.ii.nl [82.94.191.121]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by mail09.linbit.com (LINBIT Mail Daemon) with ESMTPS id 1F6541056468 for ; Fri, 8 Jul 2011 20:37:10 +0200 (CEST) Received: from s529c1e11.adsl.wanadoo.nl ([82.156.30.17]:54723 helo=kyra.ii.nl) by mail2.ii.nl with esmtpsa (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1QfFvJ-0000BM-La for drbd-dev@lists.linbit.com; Fri, 08 Jul 2011 20:37:09 +0200 Received: from dsldevice.lan ([82.156.30.17]:54708 helo=witlap.local) by kyra.ii.nl with esmtpsa (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.72) (envelope-from ) id 1QfFvI-0005ZT-Me for drbd-dev@lists.linbit.com; Fri, 08 Jul 2011 20:37:08 +0200 Message-ID: <4E174E53.5000106@ii.nl> Date: Fri, 08 Jul 2011 20:37:07 +0200 From: Mrten MIME-Version: 1.0 To: drbd-dev@lists.linbit.com References: <1310053461-15060-1-git-send-email-mrten+drbd@ii.nl> <1310053461-15060-2-git-send-email-mrten+drbd@ii.nl> <4E17116D.4080305@linbit.com> In-Reply-To: <4E17116D.4080305@linbit.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Subject: Re: [Drbd-dev] [PATCH 2/2] expand section on throughput tuning to highlight prime usecase of external metadata List-Id: Coordination of development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On 08-07-2011 16:17:17, Florian Haas wrote: > You're adding a third item to the enumeration; so it would be nice if > you could also rephrase the next paragraph which talks about "the > minimum between the two". Will do. > You're talking about a battery backup of a cache that is not there. > Does not compute. :) So true, will fix ;) >> DRBD metadata updates necessary to guarantee + data-completeness >> in case of failure can slow down + write throughput significantly. >> If a raw device is normally capable of + 250 MB/s write throughput >> it is not an anomaly to see writes as slow as + 70 MB/s with DRBD >> enabled (numbers are for rotational disks). This is + purely >> caused by head seeks; 4MB data updates have to be followed by >> metadata updates + and the data-writes can only continue after the >> metadata has been reached the + platters (caching and write >> reordering does not help). > > I'm afraid you're missing some context here. DRBD performs the > synchronous meta data updates you are referring to only when an AL > extent goes hot or cold. It doesn't do so randomly or, as your > paragraph seems to imply to a casual reader, every time it has > written 4M of data. > > And it is definitely _not_ normal to see 250MB/s write bandwidth drop > to 70 MB/s. 110 MB/s would be entirely normal if you are replicating > over Gigagit Ethernet, but that is determined by the bandwidth of the > replication link, it doesn't have much to do with AL updates. I think I should explain what I trying to convey, or rather, my mental image of what happened while I was benchmarking (and saw that huge performance drop). My backing device for DRBD is a software raid-0 (two disks), with 'meta-disk internal'. Benchmarking was done by dd'ing a few gigs from /dev/zero. All this dd-writing makes a lot of new extents hot (one for every 4MB written?), which has to be remembered in the metadata, with synchronous writes. Since my backing device is raid-0 and the default chunk size for that is rather large these days, the (small) metadata updates aren't spread over the raid-0 disks but are concentrated on one device, which becomes the bottleneck for the benchmark because it has to seek all the time. This is not a cause for concern when you have a hardware battery-backed cache, as the raid-controller can then delay writing the metadata, but I don't have that. I've blktrace-d, blkparse-d and seekwatcher-ed the hell out of this and the images show exactly that happen, so I dared to write it up like this without having read the source ;). Lots of linear writes, regularly interrupted by a seek to synchronously write the metadata. The slowdown wasn't caused by the interconnection between primary and secondary, the 70MB/s was measured both in StandAlone and UpToDate (I bonded 3 GE interfaces for nice syncing bandwidth). And it was pure benchmarking, no other things happening on the server so I'd expect that only the benchmark made extents hot. I of course do not know the exact criteria that mark extents hot, if what I described above is not an accurate description of what happens, please correct me. But the reason I think this should be in the docs is that I reckon that lots of people would like to 0+"network raid-1" with relatively cheap hardware, do the simplest of benchmarks and get confused by the slowdown. Googling this I saw this subject passing over the mailinglist a couple of times. > And what you mean by "caching and write reordering does not help" I > don't understand at all, can you elaborate please? The synchronous (barrier?) writes for the metadata, as far as I understand it from a mailing post from Lars, *must* have reached the platters before the linear dd-writing can continue. So no enabling of write caches, NCQ or tuning of elevators is going to help. However, if you think that the paragraph now implies that *every* write randomly makes extents hot then I should do some polishing ;) >> +[[s-tune-external-metadata]] [...] > This section would be ok, but it's still missing the steps to dump > the existing metadata and restore it onto the new metadata device. > Can you add that and repost the patch please? Ah, I hadn't thought of that scenario (am using a raid-1 for the metadata). Is this along the lines of: drbdadm down [resource] drbdadm dump-md [resource] > savefile [change meta-disk] drbdmeta /dev/drbdX v08 [metadevice] 0 restore-md savefile ? Is the index 0 correct usage when using flexible-meta-disk? Maarten.