hyperion-0.1.0.0
Safe HaskellNone
LanguageHaskell2010

Hyperion.Cluster

Synopsis

General comments

In this module we define the Cluster monad. It is nothing more than a Process with an environment ClusterEnv.

The ClusterEnv environment contains information about

A ClusterEnv may be initialized with newClusterEnv, which use slurmWorkerLauncher to initialize clusterWorkerLauncher. In this scenario the Cluster monad will operate in the following way. It will perform the calculations in the master process until some remote function is invoked, typically through remoteEval, at which point it will use sbatch and the current SbatchOptions to allocate a new job and then it will run a single worker in that allocation.

This has the following consequences.

  • Each time Cluster runs a remote function, it will schedule a new job with SLURM. If you run a lot of small remote functions (e.g., using Hyperion.Concurrently) in Cluster monad, it means that you will schedule a lot of small jobs with SLURM. If your cluster's scheduling prioritizes small jobs, this may be a fine mode of operation (for example, this was the case on the now-defunct Hyperion cluster at IAS). More likely though, it will lead to your jobs pending and the computation running slowly, especially if the remote functions are not run at the same time, but new ones are run when old ones finish (for example, if you try to perform a lot of parallel binary searches). For such cases Job monad should be used.
  • One should use nodes greater than 1 if either: (1) The job runs an external program that uses MPI or something similar and therefore can access all of the resources allocated by SLURM, or (2) the remote function spawns new hyperion workers using the Job monad. If your remote function does spawn new workers, then it may make sense to use nodes greater than 1, but your remote function needs to take into account the fact that the nodes are already allocated. For example, from the Cluster monad, we can run a remote computation in the Job, allocating it more than 1 node. The Job computation will automagically detect the nodes available to it, the number of CPUs on each node, and will create a WorkerCpuPool that will manage these resources independently of SLURM. One can then run remote functions on these resources from the Job computation without having to wait for SLURM scheduling. See Hyperion.Job for details.

The common usecase is that a Cluster computation is ran on the login node. It then schedules a job with a bunch or resources with SLURM. When the job starts, a Job calculation runs on one of the allocated nodes. It then spawns Process computations on the resources available to the job, which it manages via WorkerCpuPool.

Besides the Cluster monad, this module defines slurmWorkerLauncher and some utility functions for working with ClusterEnv and ProgramInfo, along with a few others.

Documentation

 

data ProgramInfo Source #

Type containing information about our program

Instances

Instances details
Eq ProgramInfo Source # 
Instance details

Defined in Hyperion.Cluster

Ord ProgramInfo Source # 
Instance details

Defined in Hyperion.Cluster

Show ProgramInfo Source # 
Instance details

Defined in Hyperion.Cluster

Generic ProgramInfo Source # 
Instance details

Defined in Hyperion.Cluster

Associated Types

type Rep ProgramInfo :: Type -> Type #

ToJSON ProgramInfo Source # 
Instance details

Defined in Hyperion.Cluster

FromJSON ProgramInfo Source # 
Instance details

Defined in Hyperion.Cluster

Binary ProgramInfo Source # 
Instance details

Defined in Hyperion.Cluster

Static (Binary ProgramInfo) Source # 
Instance details

Defined in Hyperion.Cluster

type Rep ProgramInfo Source # 
Instance details

Defined in Hyperion.Cluster

type Rep ProgramInfo = D1 ('MetaData "ProgramInfo" "Hyperion.Cluster" "hyperion-0.1.0.0-BChDBJtiU1m4GBpewNuAxw" 'False) (C1 ('MetaCons "ProgramInfo" 'PrefixI 'True) ((S1 ('MetaSel ('Just "programId") 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 ProgramId) :*: S1 ('MetaSel ('Just "programDatabase") 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 FilePath)) :*: (S1 ('MetaSel ('Just "programLogDir") 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 FilePath) :*: (S1 ('MetaSel ('Just "programDataDir") 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 FilePath) :*: S1 ('MetaSel ('Just "programSSHCommand") 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 SSHCommand)))))

data ClusterEnv Source #

The environment for Cluster monad.

Instances

Instances details
HasWorkerLauncher ClusterEnv Source #

We make ClusterEnv an instance of HasWorkerLauncher. This makes Cluster an instance of HasWorkers and gives us access to functions in Hyperion.Remote.

Instance details

Defined in Hyperion.Cluster

HasDB ClusterEnv Source #

ClusterEnv is an instance of HasDB since it contains info that is sufficient to build a DatabaseConfig.

Instance details

Defined in Hyperion.Cluster

HasProgramInfo ClusterEnv Source # 
Instance details

Defined in Hyperion.Cluster

class HasProgramInfo a where Source #

Instances

Instances details
HasProgramInfo ClusterEnv Source # 
Instance details

Defined in Hyperion.Cluster

HasProgramInfo JobEnv Source # 
Instance details

Defined in Hyperion.Job

type Cluster = ReaderT ClusterEnv Process Source #

The Cluster monad. It is simply Process with ClusterEnv environment.

data MPIJob Source #

Type representing resources for an MPI job.

Constructors

MPIJob 

Instances

Instances details
Eq MPIJob Source # 
Instance details

Defined in Hyperion.Cluster

Methods

(==) :: MPIJob -> MPIJob -> Bool #

(/=) :: MPIJob -> MPIJob -> Bool #

Ord MPIJob Source # 
Instance details

Defined in Hyperion.Cluster

Show MPIJob Source # 
Instance details

Defined in Hyperion.Cluster

Generic MPIJob Source # 
Instance details

Defined in Hyperion.Cluster

Associated Types

type Rep MPIJob :: Type -> Type #

Methods

from :: MPIJob -> Rep MPIJob x #

to :: Rep MPIJob x -> MPIJob #

ToJSON MPIJob Source # 
Instance details

Defined in Hyperion.Cluster

FromJSON MPIJob Source # 
Instance details

Defined in Hyperion.Cluster

Binary MPIJob Source # 
Instance details

Defined in Hyperion.Cluster

Methods

put :: MPIJob -> Put #

get :: Get MPIJob #

putList :: [MPIJob] -> Put #

type Rep MPIJob Source # 
Instance details

Defined in Hyperion.Cluster

type Rep MPIJob = D1 ('MetaData "MPIJob" "Hyperion.Cluster" "hyperion-0.1.0.0-BChDBJtiU1m4GBpewNuAxw" 'False) (C1 ('MetaCons "MPIJob" 'PrefixI 'True) (S1 ('MetaSel ('Just "mpiNodes") 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 Int) :*: S1 ('MetaSel ('Just "mpiNTasksPerNode") 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 Int)))

defaultDBRetries :: Int Source #

The default number of retries to use in withConnectionRetry. Set to 20.

slurmWorkerLauncher Source #

Arguments

:: Maybe Text

Email address to send notifications to if sbatch fails or there is an error in a remote job. Nothing means no emails will be sent.

-> FilePath

Path to this hyperion executable

-> HoldMap

HoldMap used by the HoldServer

-> Int

Port used by the HoldServer (needed for error messages)

-> TokenPool

TokenPool for throttling the number of submitted jobs

-> SbatchOptions 
-> ProgramInfo 
-> WorkerLauncher JobId 

newWorkDir :: (Binary a, Typeable a, ToJSON a, HasProgramInfo env, HasDB env, MonadReader env m, MonadIO m, MonadCatch m) => a -> m FilePath Source #

Construct a working directory for the given object, using its ObjectId. Will be a subdirectory of programDataDir. Created automatically, and saved in the database.