| Safe Haskell | None |
|---|---|
| Language | Haskell2010 |
Hyperion.Job
Contents
Synopsis
- data JobEnv = JobEnv {}
- data NodeLauncherConfig = NodeLauncherConfig {}
- type Job = ReaderT JobEnv Process
- setTaskCpus :: NumCPUs -> JobEnv -> JobEnv
- runJobSlurm :: ProgramInfo -> Job a -> Process a
- runJobLocal :: ProgramInfo -> Job a -> IO a
- workerLauncherWithRunCmd :: MonadIO m => FilePath -> ((String, [String]) -> Process ()) -> m (WorkerLauncher JobId)
- withNodeLauncher :: NodeLauncherConfig -> WorkerAddr -> (Maybe (WorkerAddr, WorkerLauncher JobId) -> Process a) -> Process a
- runCmdLocalAsync :: (String, [String]) -> IO ()
- runCmdLocalLog :: (String, [String]) -> IO ()
- withPoolLauncher :: NodeLauncherConfig -> [WorkerAddr] -> ((NumCPUs -> WorkerLauncher JobId) -> Process a) -> Process a
- remoteEvalJobM :: (Static (Binary b), Typeable b) => Cluster (Closure (Job b)) -> Cluster b
- remoteEvalJob :: (Static (Binary b), Typeable b) => Closure (Job b) -> Cluster b
General comments
In this module we define the Job monad. It is nothing more than Process
together with JobEnv environment.
The JobEnv environment represents the environment of a job running under
SLURM. We should think about a computation in Job as being run on a
node allocated for the job by SLURM and running remote computations on the
resources allocated to the job. The JobEnv environment
contains
- information about the master program that scheduled the job,
- information about the database used for recording results of the calculations,
- number of CPUs available per node, as well as the number of CPUs to
use for remote computations spawned from the
Jobcomputation (jobTaskCpus), jobTaskLauncher, which allocatesjobTaskCpusCPUs on some node from the resources available to the job and launches a worker on that node. That worker is then allowed to use the allocated number of CPUs. Thanks tojobTaskLauncher,Jobis an instance ofHasWorkersand we can use functions such asremoteEval.
The common usecase is that the Job computation is spawned from a Cluster
calculation on login node via, e.g., remoteEvalJob (which acquires job
resources from SLURM). The Job computation then manages the job resources
and runs remote computations in the allocation via, e.g., remoteEval.
Documentation
The environment type for Job monad.
Constructors
| JobEnv | |
Fields
| |
Instances
| HasWorkerLauncher JobEnv Source # | Make This makes |
Defined in Hyperion.Job Methods | |
| HasDB JobEnv Source # | |
Defined in Hyperion.Job Methods dbConfigLens :: Lens' JobEnv DatabaseConfig Source # | |
| HasProgramInfo JobEnv Source # | |
Defined in Hyperion.Job Methods toProgramInfo :: JobEnv -> ProgramInfo Source # | |
data NodeLauncherConfig Source #
Configuration for withNodeLauncher.
Constructors
| NodeLauncherConfig | |
Fields
| |
setTaskCpus :: NumCPUs -> JobEnv -> JobEnv Source #
Changses jobTaskCpus in JobEnv
runJobSlurm :: ProgramInfo -> Job a -> Process a Source #
Runs the Job monad assuming we are inside a SLURM job. In
practice it just fills in the environment JobEnv and calls
runReaderT. The environment is mostly constructed from SLURM
environment variables and ProgramInfo. The exceptions to these
are jobTaskCpus, which is set to , and
NumCPUs 1jobTaskLauncher, which is created by withPoolLauncher.
The log file has the form "/a/b/c/progid/serviceid.log"
. The log directory for the node is obtained by dropping
the .log extension: "/a/b/c/progid/serviceid"
runJobLocal :: ProgramInfo -> Job a -> IO a Source #
Runs the Job locally in IO without using any information from a
SLURM environment, with some basic default settings. This function
is provided primarily for testing.
workerLauncherWithRunCmd :: MonadIO m => FilePath -> ((String, [String]) -> Process ()) -> m (WorkerLauncher JobId) Source #
WorkerLauncher that uses the supplied command runner to launch
workers. Sets connectionTimeout to Nothing. Uses the
ServiceId supplied to withLaunchedWorker to construct JobId
(through JobName). The supplied FilePath is used as log
directory for the worker, with the log file name derived from
ServiceId.
withNodeLauncher :: NodeLauncherConfig -> WorkerAddr -> (Maybe (WorkerAddr, WorkerLauncher JobId) -> Process a) -> Process a Source #
Given a NodeLauncherConfig and a WorkerAddr runs the continuation
Maybe passing it a pair (. Passing WorkerAddr, WorkerLauncher
JobId)Nothing repersents ssh failure.
While WorkerAddr is preserved, the passed WorkerLauncher
launches workers on the node at WorkerAddr. The launcher is
derived from workerLauncherWithRunCmd, where command runner is
either local shell (if WorkerAddr is LocalHost) or a
RemoteFunction that runs the local shell on WorkerAddr via
withRemoteRunProcess and related functions (if WorkerAddr is
RemoteAddr).
Note that the process of launching a worker on the remote node will
actually spawn an "utility" worker there that will launch all new
workers in the continuation. This utility worker will have its log
in the log dir, identified by some random ServiceId and put
messages like "Running command ...".
The reason that utility workers are used on each Job node is to
minimize the number of calls to ssh or srun. The naive way to
launch workers in the Job monad would be to determine what node
they should be run on, and run the hyperion worker command via
ssh. Unfortunately, many clusters have flakey ssh
configurations that start throwing errors if ssh is called too
many times in quick succession. ssh also has to perform
authentication. Experience shows that srun is also not a good
solution to this problem, since srun talks to SLURM to manage
resources and this can take a long time, affecting
performance. Instead, we ssh exactly once to each node in the Job
(besides the head node), and start utility workers there. These
workers can then communicate with the head node via the usual
machinery of hyperion --- effectively, we keep a connection open
to each node so that we no longer have to use ssh.
runCmdLocalAsync :: (String, [String]) -> IO () Source #
Run the given command in a child thread. Async.link ensures that exceptions from the child are propagated to the parent.
NB: Previously, this function used createProcess
and discarded the resulting ProcessHandle. This could result in
"insufficient resource" errors for OS threads. Hopefully the
current implementation avoids this problem.
runCmdLocalLog :: (String, [String]) -> IO () Source #
Run the given command and log the command. This is suitable for running on remote machines so we can keep track of what is being run where.
withPoolLauncher :: NodeLauncherConfig -> [WorkerAddr] -> ((NumCPUs -> WorkerLauncher JobId) -> Process a) -> Process a Source #
Takes a NodeLauncherConfig and a list of addresses. Tries to
start "worker-launcher" workers on these addresses (see
withNodeLauncher). Discards addresses on which the this
fails. From remaining addresses builds a worker CPU pool. The
continuation is then passed a function that launches workers in
this pool. The WorkerLaunchers that continuation gets have
connectionTimeout to Nothing.