Safe Haskell | None |
---|---|
Language | Haskell2010 |
Hyperion.Job
Contents
Synopsis
- data JobEnv = JobEnv {}
- data NodeLauncherConfig = NodeLauncherConfig {}
- type Job = ReaderT JobEnv Process
- setTaskCpus :: NumCPUs -> JobEnv -> JobEnv
- runJobSlurm :: ProgramInfo -> Job a -> Process a
- runJobLocal :: ProgramInfo -> Job a -> IO a
- workerLauncherWithRunCmd :: MonadIO m => FilePath -> ((String, [String]) -> Process ()) -> m (WorkerLauncher JobId)
- withNodeLauncher :: NodeLauncherConfig -> WorkerAddr -> (Maybe (WorkerAddr, WorkerLauncher JobId) -> Process a) -> Process a
- runCmdLocalAsync :: (String, [String]) -> IO ()
- runCmdLocalLog :: (String, [String]) -> IO ()
- withPoolLauncher :: NodeLauncherConfig -> [WorkerAddr] -> ((NumCPUs -> WorkerLauncher JobId) -> Process a) -> Process a
- remoteEvalJobM :: (Static (Binary b), Typeable b) => Cluster (Closure (Job b)) -> Cluster b
- remoteEvalJob :: (Static (Binary b), Typeable b) => Closure (Job b) -> Cluster b
General comments
In this module we define the Job
monad. It is nothing more than Process
together with JobEnv
environment.
The JobEnv
environment represents the environment of a job running under
SLURM
. We should think about a computation in Job
as being run on a
node allocated for the job by SLURM
and running remote computations on the
resources allocated to the job. The JobEnv
environment
contains
- information about the master program that scheduled the job,
- information about the database used for recording results of the calculations,
- number of CPUs available per node, as well as the number of CPUs to
use for remote computations spawned from the
Job
computation (jobTaskCpus
), jobTaskLauncher
, which allocatesjobTaskCpus
CPUs on some node from the resources available to the job and launches a worker on that node. That worker is then allowed to use the allocated number of CPUs. Thanks tojobTaskLauncher
,Job
is an instance ofHasWorkers
and we can use functions such asremoteEval
.
The common usecase is that the Job
computation is spawned from a Cluster
calculation on login node via, e.g., remoteEvalJob
(which acquires job
resources from SLURM
). The Job
computation then manages the job resources
and runs remote computations in the allocation via, e.g., remoteEval
.
Documentation
The environment type for Job
monad.
Constructors
JobEnv | |
Fields
|
Instances
HasWorkerLauncher JobEnv Source # | Make This makes |
Defined in Hyperion.Job Methods | |
HasDB JobEnv Source # | |
Defined in Hyperion.Job Methods dbConfigLens :: Lens' JobEnv DatabaseConfig Source # | |
HasProgramInfo JobEnv Source # | |
Defined in Hyperion.Job Methods toProgramInfo :: JobEnv -> ProgramInfo Source # |
data NodeLauncherConfig Source #
Configuration for withNodeLauncher
.
Constructors
NodeLauncherConfig | |
Fields
|
setTaskCpus :: NumCPUs -> JobEnv -> JobEnv Source #
Changses jobTaskCpus
in JobEnv
runJobSlurm :: ProgramInfo -> Job a -> Process a Source #
Runs the Job
monad assuming we are inside a SLURM job. In
practice it just fills in the environment JobEnv
and calls
runReaderT
. The environment is mostly constructed from SLURM
environment variables and ProgramInfo
. The exceptions to these
are jobTaskCpus
, which is set to
, and
NumCPUs
1jobTaskLauncher
, which is created by withPoolLauncher
.
The log file has the form "/a/b/c/progid/serviceid.log"
. The log directory for the node is obtained by dropping
the .log extension: "/a/b/c/progid/serviceid"
runJobLocal :: ProgramInfo -> Job a -> IO a Source #
Runs the Job
locally in IO without using any information from a
SLURM environment, with some basic default settings. This function
is provided primarily for testing.
workerLauncherWithRunCmd :: MonadIO m => FilePath -> ((String, [String]) -> Process ()) -> m (WorkerLauncher JobId) Source #
WorkerLauncher
that uses the supplied command runner to launch
workers. Sets connectionTimeout
to Nothing
. Uses the
ServiceId
supplied to withLaunchedWorker
to construct JobId
(through JobName
). The supplied FilePath
is used as log
directory for the worker, with the log file name derived from
ServiceId
.
withNodeLauncher :: NodeLauncherConfig -> WorkerAddr -> (Maybe (WorkerAddr, WorkerLauncher JobId) -> Process a) -> Process a Source #
Given a NodeLauncherConfig
and a WorkerAddr
runs the continuation
Maybe
passing it a pair (
. Passing WorkerAddr
, WorkerLauncher
JobId
)Nothing
repersents ssh
failure.
While WorkerAddr
is preserved, the passed WorkerLauncher
launches workers on the node at WorkerAddr
. The launcher is
derived from workerLauncherWithRunCmd
, where command runner is
either local shell (if WorkerAddr
is LocalHost
) or a
RemoteFunction
that runs the local shell on WorkerAddr
via
withRemoteRunProcess
and related functions (if WorkerAddr
is
RemoteAddr
).
Note that the process of launching a worker on the remote node will
actually spawn an "utility" worker there that will launch all new
workers in the continuation. This utility worker will have its log
in the log dir, identified by some random ServiceId
and put
messages like "Running command ...".
The reason that utility workers are used on each Job node is to
minimize the number of calls to ssh
or srun
. The naive way to
launch workers in the Job
monad would be to determine what node
they should be run on, and run the hyperion worker command via
ssh
. Unfortunately, many clusters have flakey ssh
configurations that start throwing errors if ssh
is called too
many times in quick succession. ssh
also has to perform
authentication. Experience shows that srun
is also not a good
solution to this problem, since srun
talks to SLURM
to manage
resources and this can take a long time, affecting
performance. Instead, we ssh
exactly once to each node in the Job
(besides the head node), and start utility workers there. These
workers can then communicate with the head node via the usual
machinery of hyperion
--- effectively, we keep a connection open
to each node so that we no longer have to use ssh
.
runCmdLocalAsync :: (String, [String]) -> IO () Source #
Run the given command in a child thread. Async.link ensures that exceptions from the child are propagated to the parent.
NB: Previously, this function used createProcess
and discarded the resulting ProcessHandle
. This could result in
"insufficient resource" errors for OS threads. Hopefully the
current implementation avoids this problem.
runCmdLocalLog :: (String, [String]) -> IO () Source #
Run the given command and log the command. This is suitable for running on remote machines so we can keep track of what is being run where.
withPoolLauncher :: NodeLauncherConfig -> [WorkerAddr] -> ((NumCPUs -> WorkerLauncher JobId) -> Process a) -> Process a Source #
Takes a NodeLauncherConfig
and a list of addresses. Tries to
start "worker-launcher" workers on these addresses (see
withNodeLauncher
). Discards addresses on which the this
fails. From remaining addresses builds a worker CPU pool. The
continuation is then passed a function that launches workers in
this pool. The WorkerLaunchers
that continuation gets have
connectionTimeout
to Nothing
.