hyperion-0.1.0.0
Safe HaskellNone
LanguageHaskell2010

Hyperion.WorkerCpuPool

Synopsis

General comments

This module defines WorkerCpuPool, a datatype that provides a mechanism for hyperion to manage the resources allocated to it by SLURM. The only resource that is managed are the CPU's on the allocated nodes. This module works under the assumption that the same number of CPU's has been allocated on all the nodes allocated to the job.

A WorkerCpuPool is essentially a TVar containing the Map that maps node addresses to the number of CPU's available on that node. The addess can be a remote node or the local node on which WorkerCpuPool is hosted.

The most important function defined in this module is withWorkerAddr which allocates the requested number of CPUs from the pull on a single node and runs a user function with the address of that node. The allocation mechanism is very simple and allocates CPU's on the worker which has the most idle CPUs.

We also provide sshRunCmd for running commands on the nodes via ssh.

WorkerCpuPool documentation

 

newtype NumCPUs Source #

A newtype for the number of available CPUs

Constructors

NumCPUs Int 

Instances

Instances details
Eq NumCPUs Source # 
Instance details

Defined in Hyperion.WorkerCpuPool

Methods

(==) :: NumCPUs -> NumCPUs -> Bool #

(/=) :: NumCPUs -> NumCPUs -> Bool #

Num NumCPUs Source # 
Instance details

Defined in Hyperion.WorkerCpuPool

Ord NumCPUs Source # 
Instance details

Defined in Hyperion.WorkerCpuPool

data WorkerCpuPool Source #

The WorkerCpuPool type, contaning a map of available CPU resources

Constructors

WorkerCpuPool 

getAddrs :: WorkerCpuPool -> IO [WorkerAddr] Source #

Gets a list of all WorkerAddr registered in WorkerCpuPool

data WorkerAddr Source #

A WorkerAddr representing a node address. Can be a remote node or the local node

getSlurmAddrs :: IO [WorkerAddr] Source #

Reads the system environment to obtain the list of nodes allocated to the job. If the local node is in the list, then records it too, as LocalHost.

newJobPool :: [WorkerAddr] -> IO WorkerCpuPool Source #

Reads the system environment to determine the number of CPUs available on each node (the same number on each node), and creates a new WorkerCpuPool for the [WorkerAddr] assuming that all CPUs are available.

withWorkerAddr :: (MonadIO m, MonadMask m) => WorkerCpuPool -> NumCPUs -> (WorkerAddr -> m a) -> m a Source #

Finds the worker with the most available CPUs and runs the given routine with the address of that worker. Blocks if the number of available CPUs is less than the number requested.

sshRunCmd documentation

 

type SSHCommand = Maybe (String, [String]) Source #

The type for the command used to run ssh. If a Just value, then the first String gives the name of ssh executable, e.g. "ssh", and the list of Strings gives the options to pass to ssh. For example, with SSHCommand given by ("XX", ["-a", "-b"]), ssh is run as

XX -a -b <addr> <command>

where <addr> is the remote address and <command> is the command we need to run there.

The value of Nothing is equivalent to using

ssh -f -o "UserKnownHostsFile /dev/null" <addr> <command>

We need -o "..." option to deal with host key verification failures. We use -f to force ssh to go to the background before executing the sh call. This allows for a faster return from readCreateProcessWithExitCode.

Note that "UserKnownHostsFile /dev/null" doesn't seem to work on Helios. Using instead "StrictHostKeyChecking=no" seems to work.

sshRunCmd :: String -> SSHCommand -> (String, [String]) -> IO () Source #

Runs a given command on remote host (with address given by the first String) with the given arguments via ssh using the SSHCommand. Makes at most 10 attempts via retryRepeated. If fails, propagates SSHError outside.

ssh needs to be able to authenticate on the remote node without a password. In practice you will probably need to set up public key authentiticaion.

ssh is invoked to run sh that calls nohup to run the supplied command in background.