From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ben Evans Date: Fri, 20 Jan 2017 22:00:21 +0000 Subject: [lustre-devel] Proposal for JobID caching In-Reply-To: <9BEFFD88-0537-43AB-8352-6477F30906DA@intel.com> References: <6E2CFE03-A158-4D82-82BA-AF0A175AA358@intel.com> <9BEFFD88-0537-43AB-8352-6477F30906DA@intel.com> Message-ID: List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: lustre-devel@lists.lustre.org On 1/20/17, 4:50 PM, "Dilger, Andreas" wrote: >On Jan 18, 2017, at 13:39, Oleg Drokin wrote: >> >> >> On Jan 18, 2017, at 3:08 PM, Ben Evans wrote: >> >>> Overview >>> The Lustre filesystem added the ability to track I/O >>>performance of a job across a cluster. The initial algorithm was >>>relatively simplistic: for every I/O, look up the job ID of the >>>process and include it in the RPC being sent to the server. This >>>imposed a non-trivial performance impact on client I/O performance. >>> An additional algorithm was introduced to handle the single >>>job per node case, where instead of looking up the job ID of the >>>process, Lustre simply accesses the value of a variable set through the >>>proc interface. This improved performance greatly, but only functions >>>when a single job is being run. >>> A new approach is needed for multiple job per node systems. >>> >>> Proposed Solution >>> The proposed solution to this is to create a small >>>PID->JobID table in kernel memory. When a process performs an IO, a >>>lookup is done in the table for the PID, if a JobID exists for that >>>PID, it is used, otherwise it is retrieved via the same methods as the >>>original Jobstats algorithm. Once located the JobID is stored in a >>>PID/JobID table in memory. The existing cfs_hash_table structure and >>>functions will be used to implement the table. >>> >>> Rationale >>> This reduces the number of calls into userspace, minimizing >>>the time taken on each I/O. It also easily supports multiple job per >>>node scenarios, and like other proposed solutions has no issue with >>>multiple jobs performing I/O on the same file at the same time. >>> >>> Requirements >>> ? Performance cannot significantly detract from baseline >>>performance without jobstats >>> ? Supports multiple jobs per node >>> ? Coordination with the scheduler is not required, but interfaces >>>may be provided >>> ? Supports multiple PIDs per job >>> >>> New Data Structures >>> pid_to_jobid { >>> struct hlist_node pj_hash; >>> u54 pj_pid; >>> char pj_jobid[LUSTRE_JOBID_SIZE]; >>> spinlock_t jp_lock; >>> time_t jp_time; >>> } >>> Proc Variables >>> Writing to /proc/fs/lustre/jobid_name while not in ?nodelocal? mode >>>will cause all entries in the cache for that jobID to be removed from >>>the cache >>> >>> Populating the Cache >>> When lustre_get_jobid is called, the process, and in the >>>cached mode, first a check will be done in the cache for a valid PID to >>>JobID mapping. If none exists, it uses the same mechanisms to get the >>>JobID and populates the appropriate PID to JobID map. >>> If a lookup is performed and the PID to JobID mapping exists, but is >>>more than 30 seconds old, the JobID is refreshed. >>> Purging the Cache >>> The cache can be purged of a specific job by writing the >>>JobID to the jobid_name proc file. Any items in the cache that are >>>more than 300 seconds out of date will also be purged at this time. >> >> >> I'd much rather prefer you go to the table that's populated outside of >>the kernel >> somehow. >> Let's be realistic, poking around in userspace process environments for >>random >> strings is not such a great idea at all even though it did look like a >>good idea >> in the past for simplicity reasons. >> Similar to nodelocal, we probably just switch to a method where you >>call a >> particular lctl command that would mark the whole session as belonging >> to some job. This might take several forms, e.g. nodelocal itself could >> be extended to only apply to a current namespace/container >> But if you do really run different jobs in the global namespace, we >>probably can >> probably just make the lctl to spawn a shell with commands that all >>would >> be marked as a particular job? Or we can probably trace the parent of >>lctl and >> mark that so that all its children become somehow marked too. > >Having lctl spawn a shell or requiring everything to run in a container >is impractical for users, and will just make it harder to use JobID, >IMHO. The job scheduler is _already_ storing the JobID in the process >environment so that it is available to all of the threads running as part >of the job. The question is how the job prolog script can communicate >the JobID directly to Lustre without using a global /proc file? Doing an >upcall to userspace per JobID lookup is going to be *worse* for >performance than the current searching through the process environment. > >I'm not against Ben's proposal to implement a cache in the kernel for >different processes. It is unfortunate that we can't have proper >thread-local storage for Lustre, so a hash table is probably reasonable >for this (there may be thousands of threads involved). I don't think the >cl_env struct would be useful, since it is not tied to a specific thread >(AFAIK), but rather assigned as different threads enter/exit kernel >context. Note that we already have similar time-limited caches for the >identity upcall and FMD (lustre/ofd/ofd_fmd.c), so it may be useful to >see whether the code can be shared. I'll take a look at those, but implementing the hash table was a pretty simple solution, I need to work out a few kinks with memory leaks before doing real performance tests on it to make sure it performs similarly to nodelocal. >Another (not very nice) option to avoid looking through the environment >variables (which IMHO isn't so bad, even though the upstream folks don't >like it) is to associate the JobID set via /proc with a process group >internally and look the PGID up in the kernel to find the JobID. That >can be repeated each time a new JobID is set via /proc, since the PGID >would stick around for each new job/shell/process created under the PGID. > It won't be as robust as looking up the JobID in the environment, but >probably good enough for most uses. > >I would definitely also be in favor of having some way to fall back to >procname_uid if the PGID cannot be found, the job environment variable is >not available, and there is nothing in nodelocal. That's simple enough.