Safe Haskell	None
Language	Haskell2010

Hyperion.Cluster

Contents

General comments
Documentation

Synopsis

data ProgramInfo = ProgramInfo {
- programId :: ProgramId
- programDatabase :: FilePath
- programLogDir :: FilePath
- programDataDir :: FilePath
- programSSHCommand :: SSHCommand
}
data ClusterEnv = ClusterEnv {
- clusterWorkerLauncher :: SbatchOptions -> ProgramInfo -> WorkerLauncher JobId
- clusterProgramInfo :: ProgramInfo
- clusterJobOptions :: SbatchOptions
- clusterDatabasePool :: Pool
- clusterDatabaseRetries :: Int
- clusterLockMap :: LockMap
}
class HasProgramInfo a where
- toProgramInfo :: a -> ProgramInfo
type Cluster = ReaderT ClusterEnv Process
data MPIJob = MPIJob {
- mpiNodes :: Int
- mpiNTasksPerNode :: Int
}
runCluster :: ClusterEnv -> Cluster a -> IO a
modifyJobOptions :: (SbatchOptions -> SbatchOptions) -> ClusterEnv -> ClusterEnv
setJobOptions :: SbatchOptions -> ClusterEnv -> ClusterEnv
setJobTime :: NominalDiffTime -> ClusterEnv -> ClusterEnv
setJobMemory :: Text -> ClusterEnv -> ClusterEnv
setJobType :: MPIJob -> ClusterEnv -> ClusterEnv
setSlurmPartition :: Text -> ClusterEnv -> ClusterEnv
setSlurmConstraint :: Text -> ClusterEnv -> ClusterEnv
setSlurmAccount :: Text -> ClusterEnv -> ClusterEnv
setSlurmQos :: Text -> ClusterEnv -> ClusterEnv
defaultDBRetries :: Int
dbConfigFromProgramInfo :: ProgramInfo -> IO DatabaseConfig
runDBWithProgramInfo :: ProgramInfo -> ReaderT DatabaseConfig IO a -> IO a
slurmWorkerLauncher :: Maybe Text -> FilePath -> HoldMap -> Int -> TokenPool -> SbatchOptions -> ProgramInfo -> WorkerLauncher JobId
newWorkDir :: (Binary a, Typeable a, ToJSON a, HasProgramInfo env, HasDB env, MonadReader env m, MonadIO m, MonadCatch m) => a -> m FilePath

General comments

In this module we define the Cluster monad. It is nothing more than a Process with an environment ClusterEnv.

The ClusterEnv environment contains information about

the ProgramId of the current run,
the paths to database and log/data directories that we should use,
options to use when using sbatch to spawn cluster jobs,
data equivalent to DatabaseConfig to handle the database,
a WorkerLauncher to launch remote workers. More precisely, a function clusterWorkerLauncher that takes SbatchOptions and ProgramInfo to produce a WorkerLauncher.

A ClusterEnv may be initialized with newClusterEnv, which use slurmWorkerLauncher to initialize clusterWorkerLauncher. In this scenario the Cluster monad will operate in the following way. It will perform the calculations in the master process until some remote function is invoked, typically through remoteEval, at which point it will use sbatch and the current SbatchOptions to allocate a new job and then it will run a single worker in that allocation.

This has the following consequences.

Each time Cluster runs a remote function, it will schedule a new job with SLURM. If you run a lot of small remote functions (e.g., using Hyperion.Concurrently) in Cluster monad, it means that you will schedule a lot of small jobs with SLURM. If your cluster's scheduling prioritizes small jobs, this may be a fine mode of operation (for example, this was the case on the now-defunct Hyperion cluster at IAS). More likely though, it will lead to your jobs pending and the computation running slowly, especially if the remote functions are not run at the same time, but new ones are run when old ones finish (for example, if you try to perform a lot of parallel binary searches). For such cases Job monad should be used.
One should use nodes greater than 1 if either: (1) The job runs an external program that uses MPI or something similar and therefore can access all of the resources allocated by SLURM, or (2) the remote function spawns new hyperion workers using the Job monad. If your remote function does spawn new workers, then it may make sense to use nodes greater than 1, but your remote function needs to take into account the fact that the nodes are already allocated. For example, from the Cluster monad, we can run a remote computation in the Job, allocating it more than 1 node. The Job computation will automagically detect the nodes available to it, the number of CPUs on each node, and will create a WorkerCpuPool that will manage these resources independently of SLURM. One can then run remote functions on these resources from the Job computation without having to wait for SLURM scheduling. See Hyperion.Job for details.

The common usecase is that a Cluster computation is ran on the login node. It then schedules a job with a bunch or resources with SLURM. When the job starts, a Job calculation runs on one of the allocated nodes. It then spawns Process computations on the resources available to the job, which it manages via WorkerCpuPool.

Besides the Cluster monad, this module defines slurmWorkerLauncher and some utility functions for working with ClusterEnv and ProgramInfo, along with a few others.

Documentation

data ProgramInfo Source #

Type containing information about our program

Constructors

ProgramInfo
Fields programId :: ProgramId programDatabase :: FilePath programLogDir :: FilePath programDataDir :: FilePath programSSHCommand :: SSHCommand

Instances

Instances details

Eq ProgramInfo Source #
Instance details Defined in Hyperion.Cluster Methods (==) :: ProgramInfo -> ProgramInfo -> Bool # (/=) :: ProgramInfo -> ProgramInfo -> Bool #
Ord ProgramInfo Source #
Instance details Defined in Hyperion.Cluster Methods compare :: ProgramInfo -> ProgramInfo -> Ordering # (<) :: ProgramInfo -> ProgramInfo -> Bool # (<=) :: ProgramInfo -> ProgramInfo -> Bool # (>) :: ProgramInfo -> ProgramInfo -> Bool # (>=) :: ProgramInfo -> ProgramInfo -> Bool # max :: ProgramInfo -> ProgramInfo -> ProgramInfo # min :: ProgramInfo -> ProgramInfo -> ProgramInfo #
Show ProgramInfo Source #
Instance details Defined in Hyperion.Cluster Methods showsPrec :: Int -> ProgramInfo -> ShowS # show :: ProgramInfo -> String # showList :: [ProgramInfo] -> ShowS #
Generic ProgramInfo Source #
Instance details Defined in Hyperion.Cluster Associated Types type Rep ProgramInfo :: Type -> Type # Methods from :: ProgramInfo -> Rep ProgramInfo x # to :: Rep ProgramInfo x -> ProgramInfo #
ToJSON ProgramInfo Source #
Instance details Defined in Hyperion.Cluster Methods toJSON :: ProgramInfo -> Value # toEncoding :: ProgramInfo -> Encoding # toJSONList :: [ProgramInfo] -> Value # toEncodingList :: [ProgramInfo] -> Encoding #
FromJSON ProgramInfo Source #
Instance details Defined in Hyperion.Cluster Methods parseJSON :: Value -> Parser ProgramInfo # parseJSONList :: Value -> Parser [ProgramInfo] #
Binary ProgramInfo Source #
Instance details Defined in Hyperion.Cluster Methods put :: ProgramInfo -> Put # get :: Get ProgramInfo # putList :: [ProgramInfo] -> Put #
Static (Binary ProgramInfo) Source #
Instance details Defined in Hyperion.Cluster Methods closureDict :: Closure (Dict (Binary ProgramInfo))
type Rep ProgramInfo Source #
Instance details Defined in Hyperion.Cluster type Rep ProgramInfo = D1 ('MetaData "ProgramInfo" "Hyperion.Cluster" "hyperion-0.1.0.0-BChDBJtiU1m4GBpewNuAxw" 'False) (C1 ('MetaCons "ProgramInfo" 'PrefixI 'True) ((S1 ('MetaSel ('Just "programId") 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 ProgramId) :: S1 ('MetaSel ('Just "programDatabase") 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 FilePath)) :: (S1 ('MetaSel ('Just "programLogDir") 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 FilePath) :: (S1 ('MetaSel ('Just "programDataDir") 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 FilePath) :: S1 ('MetaSel ('Just "programSSHCommand") 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 SSHCommand)))))

data ClusterEnv Source #

The environment for Cluster monad.

Constructors

ClusterEnv
Fields clusterWorkerLauncher :: SbatchOptions -> ProgramInfo -> WorkerLauncher JobId clusterProgramInfo :: ProgramInfo clusterJobOptions :: SbatchOptions clusterDatabasePool :: Pool clusterDatabaseRetries :: Int clusterLockMap :: LockMap

Instances

Instances details

HasWorkerLauncher ClusterEnv Source #	We make `ClusterEnv` an instance of `HasWorkerLauncher`. This makes `Cluster` an instance of `HasWorkers` and gives us access to functions in Hyperion.Remote.
Instance details Defined in Hyperion.Cluster Methods toWorkerLauncher :: ClusterEnv -> WorkerLauncher JobId Source #
HasDB ClusterEnv Source #	`ClusterEnv` is an instance of `HasDB` since it contains info that is sufficient to build a `DatabaseConfig`.
Instance details Defined in Hyperion.Cluster Methods dbConfigLens :: Lens' ClusterEnv DatabaseConfig Source #
HasProgramInfo ClusterEnv Source #
Instance details Defined in Hyperion.Cluster Methods toProgramInfo :: ClusterEnv -> ProgramInfo Source #

class HasProgramInfo a where Source #

Methods

toProgramInfo :: a -> ProgramInfo Source #

Instances

Instances details

HasProgramInfo ClusterEnv Source #
Instance details Defined in Hyperion.Cluster Methods toProgramInfo :: ClusterEnv -> ProgramInfo Source #
HasProgramInfo JobEnv Source #
Instance details Defined in Hyperion.Job Methods toProgramInfo :: JobEnv -> ProgramInfo Source #

type Cluster = ReaderT ClusterEnv Process Source #

The Cluster monad. It is simply Process with ClusterEnv environment.

data MPIJob Source #

Type representing resources for an MPI job.

Constructors

MPIJob
Fields mpiNodes :: Int mpiNTasksPerNode :: Int

Instances

Instances details

Eq MPIJob Source #
Instance details Defined in Hyperion.Cluster Methods (==) :: MPIJob -> MPIJob -> Bool # (/=) :: MPIJob -> MPIJob -> Bool #
Ord MPIJob Source #
Instance details Defined in Hyperion.Cluster Methods compare :: MPIJob -> MPIJob -> Ordering # (<) :: MPIJob -> MPIJob -> Bool # (<=) :: MPIJob -> MPIJob -> Bool # (>) :: MPIJob -> MPIJob -> Bool # (>=) :: MPIJob -> MPIJob -> Bool # max :: MPIJob -> MPIJob -> MPIJob # min :: MPIJob -> MPIJob -> MPIJob #
Show MPIJob Source #
Instance details Defined in Hyperion.Cluster Methods showsPrec :: Int -> MPIJob -> ShowS # show :: MPIJob -> String # showList :: [MPIJob] -> ShowS #
Generic MPIJob Source #
Instance details Defined in Hyperion.Cluster Associated Types type Rep MPIJob :: Type -> Type # Methods from :: MPIJob -> Rep MPIJob x # to :: Rep MPIJob x -> MPIJob #
ToJSON MPIJob Source #
Instance details Defined in Hyperion.Cluster Methods toJSON :: MPIJob -> Value # toEncoding :: MPIJob -> Encoding # toJSONList :: [MPIJob] -> Value # toEncodingList :: [MPIJob] -> Encoding #
FromJSON MPIJob Source #
Instance details Defined in Hyperion.Cluster Methods parseJSON :: Value -> Parser MPIJob # parseJSONList :: Value -> Parser [MPIJob] #
Binary MPIJob Source #
Instance details Defined in Hyperion.Cluster Methods put :: MPIJob -> Put # get :: Get MPIJob # putList :: [MPIJob] -> Put #
type Rep MPIJob Source #
Instance details Defined in Hyperion.Cluster type Rep MPIJob = D1 ('MetaData "MPIJob" "Hyperion.Cluster" "hyperion-0.1.0.0-BChDBJtiU1m4GBpewNuAxw" 'False) (C1 ('MetaCons "MPIJob" 'PrefixI 'True) (S1 ('MetaSel ('Just "mpiNodes") 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 Int) :*: S1 ('MetaSel ('Just "mpiNTasksPerNode") 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 Int)))

runCluster :: ClusterEnv -> Cluster a -> IO a Source #

modifyJobOptions :: (SbatchOptions -> SbatchOptions) -> ClusterEnv -> ClusterEnv Source #

setJobOptions :: SbatchOptions -> ClusterEnv -> ClusterEnv Source #

setJobTime :: NominalDiffTime -> ClusterEnv -> ClusterEnv Source #

setJobMemory :: Text -> ClusterEnv -> ClusterEnv Source #

setJobType :: MPIJob -> ClusterEnv -> ClusterEnv Source #

setSlurmPartition :: Text -> ClusterEnv -> ClusterEnv Source #

setSlurmConstraint :: Text -> ClusterEnv -> ClusterEnv Source #

setSlurmAccount :: Text -> ClusterEnv -> ClusterEnv Source #

setSlurmQos :: Text -> ClusterEnv -> ClusterEnv Source #

defaultDBRetries :: Int Source #

The default number of retries to use in withConnectionRetry. Set to 20.

dbConfigFromProgramInfo :: ProgramInfo -> IO DatabaseConfig Source #

runDBWithProgramInfo :: ProgramInfo -> ReaderT DatabaseConfig IO a -> IO a Source #

slurmWorkerLauncher Source #

Arguments

:: Maybe Text	Email address to send notifications to if sbatch fails or there is an error in a remote job. `Nothing` means no emails will be sent.
-> FilePath	Path to this hyperion executable
-> HoldMap	HoldMap used by the HoldServer
-> Int	Port used by the HoldServer (needed for error messages)
-> TokenPool	TokenPool for throttling the number of submitted jobs
-> SbatchOptions
-> ProgramInfo
-> WorkerLauncher JobId

newWorkDir :: (Binary a, Typeable a, ToJSON a, HasProgramInfo env, HasDB env, MonadReader env m, MonadIO m, MonadCatch m) => a -> m FilePath Source #

Construct a working directory for the given object, using its ObjectId. Will be a subdirectory of programDataDir. Created automatically, and saved in the database.