hyperion-0.1.0.0
Safe HaskellNone
LanguageHaskell2010

Hyperion.Job

Synopsis

General comments

In this module we define the Job monad. It is nothing more than Process together with JobEnv environment.

The JobEnv environment represents the environment of a job running under SLURM. We should think about a computation in Job as being run on a node allocated for the job by SLURM and running remote computations on the resources allocated to the job. The JobEnv environment contains

  • information about the master program that scheduled the job,
  • information about the database used for recording results of the calculations,
  • number of CPUs available per node, as well as the number of CPUs to use for remote computations spawned from the Job computation (jobTaskCpus),
  • jobTaskLauncher, which allocates jobTaskCpus CPUs on some node from the resources available to the job and launches a worker on that node. That worker is then allowed to use the allocated number of CPUs. Thanks to jobTaskLauncher, Job is an instance of HasWorkers and we can use functions such as remoteEval.

The common usecase is that the Job computation is spawned from a Cluster calculation on login node via, e.g., remoteEvalJob (which acquires job resources from SLURM). The Job computation then manages the job resources and runs remote computations in the allocation via, e.g., remoteEval.

Documentation

 

data JobEnv Source #

The environment type for Job monad.

Constructors

JobEnv 

Fields

Instances

Instances details
HasWorkerLauncher JobEnv Source #

Make JobEnv an instance of HasWorkerLauncher. The WorkerLauncher returned by toWorkerLauncher launches workers with jobTaskCpus CPUs available to them.

This makes Job an instance of HasWorkers and gives us access to functions in Hyperion.Remote.

Instance details

Defined in Hyperion.Job

HasDB JobEnv Source #

Make JobEnv an instance of HasDB.

Instance details

Defined in Hyperion.Job

HasProgramInfo JobEnv Source # 
Instance details

Defined in Hyperion.Job

data NodeLauncherConfig Source #

Configuration for withNodeLauncher.

Constructors

NodeLauncherConfig 

Fields

type Job = ReaderT JobEnv Process Source #

Job monad is simply Process with JobEnv environment.

runJobSlurm :: ProgramInfo -> Job a -> Process a Source #

Runs the Job monad assuming we are inside a SLURM job. In practice it just fills in the environment JobEnv and calls runReaderT. The environment is mostly constructed from SLURM environment variables and ProgramInfo. The exceptions to these are jobTaskCpus, which is set to NumCPUs 1, and jobTaskLauncher, which is created by withPoolLauncher. The log file has the form "/a/b/c/progid/serviceid.log" . The log directory for the node is obtained by dropping the .log extension: "/a/b/c/progid/serviceid"

runJobLocal :: ProgramInfo -> Job a -> IO a Source #

Runs the Job locally in IO without using any information from a SLURM environment, with some basic default settings. This function is provided primarily for testing.

workerLauncherWithRunCmd :: MonadIO m => FilePath -> ((String, [String]) -> Process ()) -> m (WorkerLauncher JobId) Source #

WorkerLauncher that uses the supplied command runner to launch workers. Sets connectionTimeout to Nothing. Uses the ServiceId supplied to withLaunchedWorker to construct JobId (through JobName). The supplied FilePath is used as log directory for the worker, with the log file name derived from ServiceId.

withNodeLauncher :: NodeLauncherConfig -> WorkerAddr -> (Maybe (WorkerAddr, WorkerLauncher JobId) -> Process a) -> Process a Source #

Given a NodeLauncherConfig and a WorkerAddr runs the continuation Maybe passing it a pair (WorkerAddr, WorkerLauncher JobId). Passing Nothing repersents ssh failure.

While WorkerAddr is preserved, the passed WorkerLauncher launches workers on the node at WorkerAddr. The launcher is derived from workerLauncherWithRunCmd, where command runner is either local shell (if WorkerAddr is LocalHost) or a RemoteFunction that runs the local shell on WorkerAddr via withRemoteRunProcess and related functions (if WorkerAddr is RemoteAddr).

Note that the process of launching a worker on the remote node will actually spawn an "utility" worker there that will launch all new workers in the continuation. This utility worker will have its log in the log dir, identified by some random ServiceId and put messages like "Running command ...".

The reason that utility workers are used on each Job node is to minimize the number of calls to ssh or srun. The naive way to launch workers in the Job monad would be to determine what node they should be run on, and run the hyperion worker command via ssh. Unfortunately, many clusters have flakey ssh configurations that start throwing errors if ssh is called too many times in quick succession. ssh also has to perform authentication. Experience shows that srun is also not a good solution to this problem, since srun talks to SLURM to manage resources and this can take a long time, affecting performance. Instead, we ssh exactly once to each node in the Job (besides the head node), and start utility workers there. These workers can then communicate with the head node via the usual machinery of hyperion --- effectively, we keep a connection open to each node so that we no longer have to use ssh.

runCmdLocalAsync :: (String, [String]) -> IO () Source #

Run the given command in a child thread. Async.link ensures that exceptions from the child are propagated to the parent.

NB: Previously, this function used createProcess and discarded the resulting ProcessHandle. This could result in "insufficient resource" errors for OS threads. Hopefully the current implementation avoids this problem.

runCmdLocalLog :: (String, [String]) -> IO () Source #

Run the given command and log the command. This is suitable for running on remote machines so we can keep track of what is being run where.

withPoolLauncher :: NodeLauncherConfig -> [WorkerAddr] -> ((NumCPUs -> WorkerLauncher JobId) -> Process a) -> Process a Source #

Takes a NodeLauncherConfig and a list of addresses. Tries to start "worker-launcher" workers on these addresses (see withNodeLauncher). Discards addresses on which the this fails. From remaining addresses builds a worker CPU pool. The continuation is then passed a function that launches workers in this pool. The WorkerLaunchers that continuation gets have connectionTimeout to Nothing.

remoteEvalJobM :: (Static (Binary b), Typeable b) => Cluster (Closure (Job b)) -> Cluster b Source #

remoteEvalJob :: (Static (Binary b), Typeable b) => Closure (Job b) -> Cluster b Source #