Safe Haskell | None |
---|---|
Language | Haskell2010 |
Hyperion.Cluster
Contents
Synopsis
- data ProgramInfo = ProgramInfo {}
- data ClusterEnv = ClusterEnv {}
- class HasProgramInfo a where
- toProgramInfo :: a -> ProgramInfo
- type Cluster = ReaderT ClusterEnv Process
- data MPIJob = MPIJob {
- mpiNodes :: Int
- mpiNTasksPerNode :: Int
- runCluster :: ClusterEnv -> Cluster a -> IO a
- modifyJobOptions :: (SbatchOptions -> SbatchOptions) -> ClusterEnv -> ClusterEnv
- setJobOptions :: SbatchOptions -> ClusterEnv -> ClusterEnv
- setJobTime :: NominalDiffTime -> ClusterEnv -> ClusterEnv
- setJobMemory :: Text -> ClusterEnv -> ClusterEnv
- setJobType :: MPIJob -> ClusterEnv -> ClusterEnv
- setSlurmPartition :: Text -> ClusterEnv -> ClusterEnv
- setSlurmConstraint :: Text -> ClusterEnv -> ClusterEnv
- setSlurmAccount :: Text -> ClusterEnv -> ClusterEnv
- setSlurmQos :: Text -> ClusterEnv -> ClusterEnv
- defaultDBRetries :: Int
- dbConfigFromProgramInfo :: ProgramInfo -> IO DatabaseConfig
- runDBWithProgramInfo :: ProgramInfo -> ReaderT DatabaseConfig IO a -> IO a
- slurmWorkerLauncher :: Maybe Text -> FilePath -> HoldMap -> Int -> TokenPool -> SbatchOptions -> ProgramInfo -> WorkerLauncher JobId
- newWorkDir :: (Binary a, Typeable a, ToJSON a, HasProgramInfo env, HasDB env, MonadReader env m, MonadIO m, MonadCatch m) => a -> m FilePath
General comments
In this module we define the Cluster
monad. It is nothing more than a
Process
with an environment ClusterEnv
.
The ClusterEnv
environment contains information about
- the
ProgramId
of the current run, - the paths to database and log/data directories that we should use,
- options to use when using
sbatch
to spawn cluster jobs, - data equivalent to
DatabaseConfig
to handle the database, - a
WorkerLauncher
to launch remote workers. More precisely, a functionclusterWorkerLauncher
that takesSbatchOptions
andProgramInfo
to produce aWorkerLauncher
.
A ClusterEnv
may be initialized with newClusterEnv
, which
use slurmWorkerLauncher
to initialize clusterWorkerLauncher
. In this
scenario the Cluster
monad will operate in the following way. It will perform
the calculations in the master process until some remote function is invoked,
typically through remoteEval
, at which point it will
use sbatch
and the current SbatchOptions
to allocate a new job and then
it will run a single worker in that allocation.
This has the following consequences.
- Each time
Cluster
runs a remote function, it will schedule a new job withSLURM
. If you run a lot of small remote functions (e.g., using Hyperion.Concurrently) inCluster
monad, it means that you will schedule a lot of small jobs withSLURM
. If your cluster's scheduling prioritizes small jobs, this may be a fine mode of operation (for example, this was the case on the now-defunctHyperion
cluster at IAS). More likely though, it will lead to your jobs pending and the computation running slowly, especially if the remote functions are not run at the same time, but new ones are run when old ones finish (for example, if you try to perform a lot of parallel binary searches). For such casesJob
monad should be used. - One should use
nodes
greater than 1 if either: (1) The job runs an external program that uses MPI or something similar and therefore can access all of the resources allocated bySLURM
, or (2) the remote function spawns newhyperion
workers using theJob
monad. If your remote function does spawn new workers, then it may make sense to usenodes
greater than 1, but your remote function needs to take into account the fact that the nodes are already allocated. For example, from theCluster
monad, we can run a remote computation in theJob
, allocating it more than 1 node. TheJob
computation will automagically detect the nodes available to it, the number of CPUs on each node, and will create aWorkerCpuPool
that will manage these resources independently ofSLURM
. One can then run remote functions on these resources from theJob
computation without having to wait forSLURM
scheduling. See Hyperion.Job for details.
The common usecase is that a Cluster
computation is ran on the login node.
It then schedules a job with a bunch or resources with SLURM
. When the job
starts, a Job
calculation runs on one of the allocated nodes. It then spawns
Process
computations on the resources available to the job, which it manages
via WorkerCpuPool
.
Besides the Cluster
monad, this module defines slurmWorkerLauncher
and
some utility functions for working with ClusterEnv
and ProgramInfo
, along
with a few others.
Documentation
data ProgramInfo Source #
Type containing information about our program
Constructors
ProgramInfo | |
Fields |
Instances
data ClusterEnv Source #
The environment for Cluster
monad.
Constructors
ClusterEnv | |
Instances
HasWorkerLauncher ClusterEnv Source # | We make |
Defined in Hyperion.Cluster Methods toWorkerLauncher :: ClusterEnv -> WorkerLauncher JobId Source # | |
HasDB ClusterEnv Source # |
|
Defined in Hyperion.Cluster Methods dbConfigLens :: Lens' ClusterEnv DatabaseConfig Source # | |
HasProgramInfo ClusterEnv Source # | |
Defined in Hyperion.Cluster Methods |
class HasProgramInfo a where Source #
Methods
toProgramInfo :: a -> ProgramInfo Source #
Instances
HasProgramInfo ClusterEnv Source # | |
Defined in Hyperion.Cluster Methods | |
HasProgramInfo JobEnv Source # | |
Defined in Hyperion.Job Methods toProgramInfo :: JobEnv -> ProgramInfo Source # |
type Cluster = ReaderT ClusterEnv Process Source #
The Cluster
monad. It is simply Process
with ClusterEnv
environment.
Type representing resources for an MPI job.
Constructors
MPIJob | |
Fields
|
Instances
Eq MPIJob Source # | |
Ord MPIJob Source # | |
Show MPIJob Source # | |
Generic MPIJob Source # | |
ToJSON MPIJob Source # | |
Defined in Hyperion.Cluster | |
FromJSON MPIJob Source # | |
Binary MPIJob Source # | |
type Rep MPIJob Source # | |
Defined in Hyperion.Cluster type Rep MPIJob = D1 ('MetaData "MPIJob" "Hyperion.Cluster" "hyperion-0.1.0.0-BChDBJtiU1m4GBpewNuAxw" 'False) (C1 ('MetaCons "MPIJob" 'PrefixI 'True) (S1 ('MetaSel ('Just "mpiNodes") 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 Int) :*: S1 ('MetaSel ('Just "mpiNTasksPerNode") 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 Int))) |
runCluster :: ClusterEnv -> Cluster a -> IO a Source #
modifyJobOptions :: (SbatchOptions -> SbatchOptions) -> ClusterEnv -> ClusterEnv Source #
setJobOptions :: SbatchOptions -> ClusterEnv -> ClusterEnv Source #
setJobTime :: NominalDiffTime -> ClusterEnv -> ClusterEnv Source #
setJobMemory :: Text -> ClusterEnv -> ClusterEnv Source #
setJobType :: MPIJob -> ClusterEnv -> ClusterEnv Source #
setSlurmPartition :: Text -> ClusterEnv -> ClusterEnv Source #
setSlurmConstraint :: Text -> ClusterEnv -> ClusterEnv Source #
setSlurmAccount :: Text -> ClusterEnv -> ClusterEnv Source #
setSlurmQos :: Text -> ClusterEnv -> ClusterEnv Source #
defaultDBRetries :: Int Source #
The default number of retries to use in withConnectionRetry
. Set to 20.
runDBWithProgramInfo :: ProgramInfo -> ReaderT DatabaseConfig IO a -> IO a Source #
Arguments
:: Maybe Text | Email address to send notifications to if sbatch
fails or there is an error in a remote
job. |
-> FilePath | Path to this hyperion executable |
-> HoldMap | HoldMap used by the HoldServer |
-> Int | Port used by the HoldServer (needed for error messages) |
-> TokenPool | TokenPool for throttling the number of submitted jobs |
-> SbatchOptions | |
-> ProgramInfo | |
-> WorkerLauncher JobId |
newWorkDir :: (Binary a, Typeable a, ToJSON a, HasProgramInfo env, HasDB env, MonadReader env m, MonadIO m, MonadCatch m) => a -> m FilePath Source #
Construct a working directory for the given object, using its
ObjectId. Will be a subdirectory of programDataDir
. Created
automatically, and saved in the database.