From mboxrd@z Thu Jan 1 00:00:00 1970 From: Piergiorgio Sartor Subject: Re: SSD - TRIM command Date: Wed, 9 Feb 2011 20:21:02 +0100 Message-ID: <20110209192101.GA20745@lazy.lzy> References: <20110209161916.GB8632@bounceswoosh.org> <20110209171744.GC8632@bounceswoosh.org> <20110209182426.GA2724@lazy.lzy> <20110209183814.GA7142@lazy.lzy> <20110209191359.GB7169@lazy.lzy> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Content-Disposition: inline In-Reply-To: Sender: linux-raid-owner@vger.kernel.org To: Roberto Spadim Cc: Piergiorgio Sartor , "Eric D. Mudama" , "Scott E. Armitage" , David Brown , linux-raid@vger.kernel.org List-Id: linux-raid.ids > yeah =3D) > a question... > if i send a TRIM to a sector > if i read from it > what i have? > 0x00000000000000000000000000000000000 ? > if yes, we could translate TRIM to WRITE on devices without TRIM (har= d disks) > just to have the same READ information It seems the 0x0 is not a standard. Return values seem to be quite undefined, even if 0x0 *might* be common. Second, why do you want to emulate the 0x0 thing? I do not see the point of writing zero on a device which do not support TRIM. Just do nothing seems a better choice, even in mixed environment. bye, pg =20 > 2011/2/9 Piergiorgio Sartor : > >> it=B4s just a discussion, right? no implementation yet, right? > > > > Of course... > > > >> what i think.... > >> if device accept TRIM, we can use TRIM. > >> if not, we must translate TRIM to something similar (maybe many WR= ITES > >> ?), and when we READ from disk we get the same information > > > > TRIM is not about writing at all. TRIM tells the > > device that the addressed block is not anymore used, > > so it (the SSD) can do whatever it wants with it. > > > > The only software layer having the same "knowledge" > > is the filesystem, the other layers, do not have > > any decisional power about the block allocation. > > Except for metadata, of course. > > > > So, IMHO, a software TRIM can only be in the FS. > > > > bye, > > > > pg > > > >> the translation coulbe be done by kernel (not md) maybe options on > >> libata, nbd device.... > >> other option is do it with md, internal (md) TRIM translate functi= on > >> > >> who send trim? > >> internal md information: md can generate it (if necessary, maybe i= t=B4s > >> not...) for parity disks (not data disks) > >> filesystem/or another upper layer program (database with direct de= vice > >> access), we could accept TRIM from filesystem/database, and send i= t to > >> disks/mirrors, when necessary translate it (internal or kernel > >> translate function) > >> > >> > >> 2011/2/9 Piergiorgio Sartor : > >> > On Wed, Feb 09, 2011 at 04:30:15PM -0200, Roberto Spadim wrote: > >> >> nice =3D) > >> >> but check that parity block is a raid information, not a filesy= stem information > >> >> for raid we could implement trim when possible (like swap) > >> >> and implement a trim that we receive from filesystem, and send = to all > >> >> disks (if it=B4s a raid1 with mirrors, we should sent to all mi= rrors) > >> > > >> > To all disk also in case of RAID-5? > >> > > >> > What if the TRIM belongs only to a single SDD block > >> > belonging to a single chunk of a stripe? > >> > That is a *single* SSD of the RAID-5. > >> > > >> > Should md re-read the block and re-write (not TRIM) > >> > the parity? > >> > > >> > I think anything that has to do with checking & > >> > repairing must be carefully considered... > >> > > >> > bye, > >> > > >> > pg > >> > > >> >> i don=B4t know what trim do very well, but i think it=B4s a ver= y big write > >> >> with only some bits for example: > >> >> set sector1=3D'000000000000000000000000000000000000000000000000= 00' > >> >> could be replace by: > >> >> trim sector1 > >> >> it=B4s faster for sata communication, and it=B4s a good informa= tion for > >> >> hard disk (it can put a single '0' at the start of the sector a= nd know > >> >> that all sector is 0, if it try to read any information it can = use > >> >> internal memory (don=B4t read hard disk), if a write is done it= should > >> >> write 0000 to bits, and after after the write operation, but it= =B4s > >> >> internal function of hard disk/ssd, not a problem of md raid...= md > >> >> raid should need know how to optimize and use it =3D] ) > >> >> > >> >> 2011/2/9 Piergiorgio Sartor : > >> >> >> ext4 send trim commands to device (disk/md raid/nbd) > >> >> >> kernel swap send this commands (when possible) to device too > >> >> >> for internal raid5 parity disk this could be done by md, for= data > >> >> >> disks this should be done by ext4 > >> >> > > >> >> > That's an interesting point. > >> >> > > >> >> > On which basis should a parity "block" get a TRIM? > >> >> > > >> >> > If you ask me, I think the complete TRIM story is, at > >> >> > best, a temporary patch. > >> >> > > >> >> > IMHO the wear levelling should be handled by the filesystem > >> >> > and, with awarness of this, by the underlining device drivers= =2E > >> >> > Reason is that the FS knows better what's going on with the > >> >> > blocks and what will happen. > >> >> > > >> >> > bye, > >> >> > > >> >> > pg > >> >> > > >> >> >> > >> >> >> the other question... about resync with only write what is d= ifferent > >> >> >> this is very good since write and read speed can be differen= t for ssd > >> >> >> (hd don=B4t have this 'problem') > >> >> >> but i=B4m sure that just write what is diff is better than w= rite all > >> >> >> (ssd life will be bigger, hd maybe... i think that will be b= igger too) > >> >> >> > >> >> >> > >> >> >> 2011/2/9 Eric D. Mudama : > >> >> >> > On Wed, Feb =A09 at 11:28, Scott E. Armitage wrote: > >> >> >> >> > >> >> >> >> Who sends this command? If md can assume that determinate= mode is > >> >> >> >> always set, then RAID 1 at least would remain consistent.= For RAID 5, > >> >> >> >> consistency of the parity information depends on the dete= rminate > >> >> >> >> pattern used and the number of disks. If you used determi= nate > >> >> >> >> all-zero, then parity information would always be consist= ent, but this > >> >> >> >> is probably not preferable since every TRIM command would= incur an > >> >> >> >> extra write for each bit in each page of the block. > >> >> >> > > >> >> >> > True, and there are several solutions. =A0Maybe track spac= e used via > >> >> >> > some mechanism, such that when you trim you're only trimmi= ng the > >> >> >> > entire stripe width so no parity is required for the trimm= ed regions. > >> >> >> > Or, trust the drive's wear leveling and endurance rating, = combined > >> >> >> > with SMART data, to indicate when you need to replace the = device > >> >> >> > preemptive to eventual failure. > >> >> >> > > >> >> >> > It's not an unsolvable issue. =A0If the RAID5 used distrib= uted parity, > >> >> >> > you could expect wear leveling to wear all the devices eve= nly, since > >> >> >> > on average, the # of writes to all devices will be the sam= e. =A0Only a > >> >> >> > RAID4 setup would see a lopsided amount of writes to a sin= gle device. > >> >> >> > > >> >> >> > --eric > >> >> >> > > >> >> >> > -- > >> >> >> > Eric D. Mudama > >> >> >> > edmudama@bounceswoosh.org > >> >> >> > > >> >> >> > -- > >> >> >> > To unsubscribe from this list: send the line "unsubscribe = linux-raid" in > >> >> >> > the body of a message to majordomo@vger.kernel.org > >> >> >> > More majordomo info at =A0http://vger.kernel.org/majordomo= -info.html > >> >> >> > > >> >> >> > >> >> >> > >> >> >> > >> >> >> -- > >> >> >> Roberto Spadim > >> >> >> Spadim Technology / SPAEmpresarial > >> >> >> -- > >> >> >> To unsubscribe from this list: send the line "unsubscribe li= nux-raid" in > >> >> >> the body of a message to majordomo@vger.kernel.org > >> >> >> More majordomo info at =A0http://vger.kernel.org/majordomo-i= nfo.html > >> >> > > >> >> > -- > >> >> > > >> >> > piergiorgio > >> >> > -- > >> >> > To unsubscribe from this list: send the line "unsubscribe lin= ux-raid" in > >> >> > the body of a message to majordomo@vger.kernel.org > >> >> > More majordomo info at =A0http://vger.kernel.org/majordomo-in= fo.html > >> >> > > >> >> > >> >> > >> >> > >> >> -- > >> >> Roberto Spadim > >> >> Spadim Technology / SPAEmpresarial > >> >> -- > >> >> To unsubscribe from this list: send the line "unsubscribe linux= -raid" in > >> >> the body of a message to majordomo@vger.kernel.org > >> >> More majordomo info at =A0http://vger.kernel.org/majordomo-info= =2Ehtml > >> > > >> > -- > >> > > >> > piergiorgio > >> > -- > >> > To unsubscribe from this list: send the line "unsubscribe linux-= raid" in > >> > the body of a message to majordomo@vger.kernel.org > >> > More majordomo info at =A0http://vger.kernel.org/majordomo-info.= html > >> > > >> > >> > >> > >> -- > >> Roberto Spadim > >> Spadim Technology / SPAEmpresarial > > > > -- > > > > piergiorgio > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-rai= d" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at =A0http://vger.kernel.org/majordomo-info.htm= l > > >=20 >=20 >=20 > --=20 > Roberto Spadim > Spadim Technology / SPAEmpresarial --=20 piergiorgio -- To unsubscribe from this list: send the line "unsubscribe linux-raid" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html