From mboxrd@z Thu Jan 1 00:00:00 1970 From: NeilBrown Subject: Re: [RFC 1/2]raid1: only write mismatch sectors in sync Date: Tue, 18 Sep 2012 14:57:10 +1000 Message-ID: <20120918145710.55394bd4@notabene.brown> References: <20120726080150.GA21457@kernel.org> <20120731155304.11c40f9b@notabene.brown> <20120911105908.51681433@notabene.brown> <20120912052941.GA15827@kernel.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=PGP-SHA1; boundary="Sig_/lk4mr.RBYhlHa+6wbdFcSoM"; protocol="application/pgp-signature" Return-path: In-Reply-To: <20120912052941.GA15827@kernel.org> Sender: linux-raid-owner@vger.kernel.org To: Shaohua Li Cc: linux-raid@vger.kernel.org List-Id: linux-raid.ids --Sig_/lk4mr.RBYhlHa+6wbdFcSoM Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable On Wed, 12 Sep 2012 13:29:41 +0800 Shaohua Li wrote: > On Tue, Sep 11, 2012 at 10:59:08AM +1000, NeilBrown wrote: > > On Tue, 31 Jul 2012 16:12:04 +0800 Shaohua Li wrote: > >=20 > > > 2012/7/31 NeilBrown : > > > > On Thu, 26 Jul 2012 16:01:50 +0800 Shaohua Li wro= te: > > > > > > > >> Write has some impacts to SSD: > > > >> 1. wear out flash. Frequent write can speed out the progress. > > > >> 2. increase the burden of garbage collection of SSD firmware. If n= o space > > > >> left for write, SSD firmware garbage collection will try to free s= ome space. > > > >> 3. slow down subsequent write. After write SSD to some extents (fo= r example, > > > >> write the whole disk), subsequent write will slow down significant= ly (because > > > >> almost every write invokes garbage collection in such case). > > > >> > > > >> We want to avoid unnecessary write as more as possible. raid sync = generally > > > >> involves a lot of unnecessary write. For example, even two disks d= on't have > > > >> any data, we write the second disk for the whole disk size. > > > >> > > > >> To reduce write, we always compare raid disk data and only write m= ismatch part. > > > >> This means sync will have extra IO read and memory compare. So thi= s scheme is > > > >> very bad for hard disk raid and sometimes SSD raid too if mismatch= part is > > > >> majority. But sometimes this can be very helpful to reduce write, = in that case, > > > >> since sync is rare operation, the extra IO/CPU usage is worthy pay= ing. People > > > >> who want to use the feature should understand the risk first. So t= his ability > > > >> is off by default, a sysfs entry can be used to enable it. > > > >> > > > >> Signed-off-by: Shaohua Li > > > >> --- > > > >> drivers/md/md.c | 41 +++++++++++++++++++++++++++++++ > > > >> drivers/md/md.h | 3 ++ > > > >> drivers/md/raid1.c | 70 +++++++++++++++++++++++++++++++++++++++= ++++++++++---- > > > >> 3 files changed, 110 insertions(+), 4 deletions(-) > > > >> > > > >> Index: linux/drivers/md/md.h > > > >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > >> --- linux.orig/drivers/md/md.h 2012-07-25 13:51:00.35377552= 1 +0800 > > > >> +++ linux/drivers/md/md.h 2012-07-26 10:36:38.500740552 +0800 > > > >> @@ -325,6 +325,9 @@ struct mddev { > > > >> #define MD_RECOVERY_FROZEN 9 > > > >> > > > >> unsigned long recovery; > > > >> +#define MD_RECOVERY_MODE_REPAIR 0 > > > >> +#define MD_RECOVERY_MODE_DISCARD 1 > > > >> + unsigned long recovery_mode; > > > > > > > > You have not documented the meaning of these two flags at all, and = I don't > > > > feel up to guessing. > > > > > > > > The patch looks more complex that it should be. The behaviour you = are > > > > suggesting is exactly the behaviour you get when MD_RECOVERY_REQUES= TED is > > > > set, so at most I expect to see a few places where that flag is tes= ted > > > > changed to test something else as well. > > > > > > > > How would this be used? It affects resync and resync happens as so= on as the > > > > array is assembled. So when and how would you set the flag which s= ays > > > > "prefer reads to writes"? It seems like it needs to be set in the = metadata. > > > > > > > > BTW RAID10 already does this - it reads and compares for a normal s= ync. So > > > > maybe just tell people to use raid10 if they want this behaviour? > > >=20 > > > It's true, the behavior likes MD_RECOVERY_REQUESTED, ie, read all dis= ks, > > > only do write if two disk data not match. But I hope it works as soon= as the > > > array is assembled. Surely we can set in the metadata, but I didn't w= ant to > > > enable it by default, since even with SSD, the read-compare-write isn= 't optimal > > > some times, because it involves extra IO read and memory compare. > > >=20 > > > It appears MD_RECOVERY_REQUESTED assumes disks are insync. > > > It doesn't read disk !insync, so it doesn't work for assemble. > > >=20 > > > In my mind, user frozen the sync first (with sync_action setting), an= d then > > > enable read-compare-write, and finally continue the sync. I can't stop > > > a recovery and set MD_RECOVERY_REQUESTED, so I added a flag. > > > Anyway, please suggest what the preferred way for this. > >=20 > > I guess I just don't find this functionality at all convincing. It isn= 't > > clear when anyone would use it, or how they would use it. It seems bes= t not > > to provide it. >=20 > The background is: For SSD, writting a hard formatted disk and fully fill= ed > disk, the first case could be 300% faster than the second case depending = on SSD > firmware and how fragmented the filled disk is. So this isn't a trival > performance issue. >=20 > The usage model is something like this: one disk is borken, add a new dis= k to > the raid. Currently we copy the whole disk of the first disk to the secon= d. For > SSD, if the first disk and second disk have most data identical or most c= ontent > of the first is 0, the copy is very bad. In this scenario, we can avoid s= ome > copy, which will make latter write to the disk faster. Thinking about 300% > faster :). >=20 > I don't think we can do this in underlayer. We don't want to do it always. > Detecting 0 is very expensive. >=20 > And this isn't intrusive to me. We only do the 'copy avoidness' in sync, = and > sync is rare event. >=20 > I hope to convince you this is a useful functionality, then we can discus= s the > implementation. >=20 Let's start with "when would someone use it". You said that you don't want to make it the default. If something isn't a default it will not be used very often, and will only be used at all if it = is reasonably easy to use. So how would someone use this? What would make them choose to use it, and what action would that take to make it happen? NeilBrown --Sig_/lk4mr.RBYhlHa+6wbdFcSoM Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.18 (GNU/Linux) iQIVAwUBUFf/Jjnsnt1WYoG5AQLlRg/9EmgVgkHfrq4MEDYy4F/aIhsSG82JNwHm 2cocP9xNCDoGFcquQcHiy1mTCx/UfwiRRASb/d3SivJ2y0OkClv66pys7B+UPP3v Dtxz6JF6reF9aZfOH/Ow0cZJwJmXfSWaZ3sfWHpru6SIteYMkma+lLL5zKKexJGV EWsJXGeI2/MwR59bA2bcllS6Gh91XXfpM4LUFi+m8ieZCRkG/JQvagtk1Z/DvEZi vfKHKj+AmuodgHzKUu5bJr2qcMc7GskRotAf5u0VUAZKsdOkfm2daqzD4/I81TSL lUA4nVNBuNP50GQuSotjiAaSXoFJIZD+/EdXb5pFtfWZZaNvRD2ozmtYylDcmMfD fgMPIWOqVqVYuVUe4AxhfgjwEypeyWGNXdJonCBASLtr8sI6CrTN7tXowmfccg++ 86jMloBv11MbzDFhjljoLCUrsxJgyMjGcYV55EovveMEM/MCrpRO+Cu9olPutQ5Y M5eD91F/0KCNCgaJY4t2LGLqKUY5eb1Ux0RWwj2r9aLR70A8V534QDmW4Ok9zamA B2Ch3hJrKJc4ivhh1+YiOkbKWY3G8xrQaQTh2dJTberdS6/x/rRLGuEjK3MZ9fWS kCr57CyYEz48KHVbv9wVHNknUYJ5xXSMnExjTv9oYfDj2RtdIIv7WBbnxcVUqW5Q HXSN/rTrhCg= =5IdA -----END PGP SIGNATURE----- --Sig_/lk4mr.RBYhlHa+6wbdFcSoM--