From mboxrd@z Thu Jan 1 00:00:00 1970 From: Nathaniel Rutman Date: Mon, 11 Feb 2008 12:33:03 -0800 Subject: [Lustre-devel] Lustre HSM HLD draft In-Reply-To: <47B062BA.1070809@cea.fr> References: <47AAE307.9040305@cea.fr> <47ACC71C.7050808@sun.com> <47B062BA.1070809@cea.fr> Message-ID: <47B0B0FF.7060805@sun.com> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: lustre-devel@lists.lustre.org Aurelien Degremont wrote: > Nathaniel Rutman a ?crit : > >> 5.1 external storage list - is this to be stored on the MGS device or a >> separate device? If the coordinator lives on the MGS, why not it's >> storage as well? In any case, it should be possible to co-locate the >> coordinator on the MGS and used the MGS's storage device, in the same >> way that the MGS can currently co-locate with the MDT. >> How does the coordinator request activity from an agent? If the >> coordinator is the RPC server, then it's up to the agents to make >> requests; agents aren't listening for RPC requests themselves. >> > > Presently, it is never said that the coordinator will live on the MGS. > The Coordinator constrains are: > 1 - Must receive various migration requests from OST/MDT. > 2 - Should be able to communicate with Agents and asks them migrations. > 3 - Should store configuration and migration logs. > > I think #1 and #2 are two differents API. The coordinator is clearly a > RPC server for the first one. How #2 should be implemented is not so > clear. What would be be the "Lustre-way" here? > With userspace servers, presumably we have some way of passing LNET messages from kernel to userspace. We should probably still go through LNET for #2 in order to use the broadest range of network fabrics. So it could be the same or similar RPC. There is no "Lustre-way" for this area - we've never done this kind of thing before. > For #3, the few logs that will be backed up here are not huge, and it > surely could be colocated with another Target, but I'm not sure this > should be mandatory. This device should be available to several servers, > for failover like the other Targets. We could imagine having more than 1 > coordinator at long term. I'm not sure it is a good idea to stick it to > another target. > Not mandatory, but possible is nice. Minimize the number of required partitions. > >> 6.3 object ref should include version number. Also include checksum? >> > > For data coherency? Should we add a explicit checksum for those values > (stored in an EA) or used a possible backend feature (Can ZFS and > ldiskfs detect EA value corruption by themselves?) ? > ZFS can, ldiskfs cannot. Anyhow, it was just a thought. Doesn't hurt to allow space for it. > >> 2.1Archiving one Lustre file >> There should not be a cache miss when archiving a lustre file; perhaps >> open-by-fid is intended to bypass atime updates >> so that the file isn't marked as "recently accessed"? >> > > Transparent access - should this avoid modification of atime/mtime? > > I would say yes. > > >> 2.2Restoring a file >> "External ID" presumably contains all information required to retrieve >> the file - tape #, path name, etc? >> Once file is copied back, we should probably restore original ctime, >> mtime, atime - coordinator is storing this, correct? >> > > External ID is an opaque value manage by the archiving tool. If the HSM > can store a lot of metadata, only a ref is needed, if not, the tool is > responsible for storing all the data it needs. Anyway, this is totally > opaque for Lustre. > I hope the HSMs will not need so many data in this field. HPSS does not > need so many data, it uses its internal DB to store them. I suppose SAM > also. > What about restore of original ctime, mtime, atime? I think we must store it in the coordinator because we must work with all HSMs, and I think it is important to restore it. > >> IV2 - why not multiple purged windows? Seems like if you're going to >> purge 1 object out of a file, you might want to purge more. >> Specifically, it will probably be a common case to purge every object of >> a file from a particular OST. This is not contiguous in a >> striped file. >> I don't see any reason to purge anything smaller than an entire object >> on an OST - is there good reason for this? >> > > Multiple purged window is subtle. If you permit this feature, you could > technically have, in the worst case, one purged window per byte, and > this could be very huge to store. Do you think you will do several holes > in the same file? In which cases? > Like I said, I don't see any reason to purge anything smaller than a full object; I would in fact disallow purging of an arbitrary byte range, and only allow purging on full-object boundaries. > In fact, the more common case is to totally purge a file which have been > migrated on HSM, and it is only an optimisation to keep the start and > the end of the file on disk, to avoid triggering tons of cache misses > with commands like "file foo/*" or a tool like Nautilus or Windows > Explorer browsing the directory. > Again, since Lustre is optimized to work with 1MB chunks anyhow, I don't think it helps much to keep less than that in the beginning / end objects, so I would say just keep the first and last blocks instead. > The purged window is stored by per object, OST object and MDT object. > So, if several objects are purged, each object will store its own purged > window. But the MDT object describing this file will store a special > purged window which starts at the smallest unavailable bytes and ends at > the first available one. The MDT purged window indicates "if you do I/O > in this range, you're not sure the date are there." or "Outside of this > area, I guarantee data are present." > Maintain multiple purged windows will be an headache, with no real need > I think. > Moreover, people have asked for an OST-object based migration, even if I > think whole file migration will be the most common case. > > > > If that's the case, then it > >> the OST must keep track of purged objects, not ranges within an existing >> object. >> > > Objects are not removed, only their datas. All metadata are kept. > > >> If the MDT is tracking purged areas also, then there's a good potential >> synergy here with a missing OST -- >> If the missing OST's objects are marked as purged, then we can >> potentially recover them automatically from >> HSM... >> > > What do you call a "missing OST" ? A corrupt one ? A offline one? > Unavailable? > Yes. All of the above. Obviously we need to distinguish between "permanently gone" and "temporarily gone". > Where will you copy back the object data ? On another OST object ? > Yes. Some kind of recovery will take place to generate a new object on a different OST and we can restore the data there. > With the purged window on each OST object and MDT and the file stripping > info, we could easily restore the missing parts. > Exactly. This is why I say we should think about this now, to allow for this possibility. > >> 4.2 How is a purge request recovered? For example, MDT says purge obj1 >> from ost1, ost1 replies "ok", but then dies before it actually >> does the purge. Reboots, doesn't know anything about purge request now, >> but MDT has marked it as purged. >> > > The OST asynchronously acknowledges the purge when it is done. The MDT > marks it purged only when it is really done. I will clarify this. > > >> V2.1 How long does OST wait for completion? Is there a timeout? We >> probably need a "no timeout if progress is being >> made" kind of function - clients currently do this kind of thing with OSTs. >> > > I'm sure Lustre already has similar mechanisms for optimized timeout in > this kind of situation we could reused here. > What you describe is a good approach I think. > > >> V2.2 No need to copy-in purged data on full-object-size writes. >> > > True. We could had such optimization. But this is only useful for small > files or very widely stripped ones, doesn't it? > No, we very frequently write entire stripes (objects). Lustre clients can optimize for this. > > Thanks for your comments. > >