main

::MPICH JOB LAUNCHER

DCOM servers are installed on each machine that will provide CPUs for parallel applications. 

MPIRun.exe must be used to launch the processes. 

MPIRun first looks in the registry to see if the current user has saved their account and password.   If the account information exists, it uses it to launch the processes, else it prompts the user for an account and password.  MPIRegister.exe is used to encrypt the account information into the registry on a per user basis.  It is stored in the CURRENT_USER section of the registry.  Batch servers would need to run MPIRegister once for the user which the batch server is logged in as or send the account and password to MPIRun each time it is executed.

MPIRun creates a job object on all the machines which will launch processes.  The user either provides a configuration file which specifies which machines to launch on or MPIRun generates a list based on installation information stored in the registry.   MPIRun connects to the MPIJob COM object on the first host in the list and asks it to create a job, specifying the rest of the machines.  The MPIJob object then connects to all the other machines in the list.  MPIRun can push an executable out to all the remote hosts if it hasn't been copied to their local file systems.  The COM object which actually creates the process on each machine runs as a service.  This means for NT4, the executable must live on the local host.  The SYSTEM account does not have access to network resources and cannot resolve shared network file names.   That is why I provided the push capability into the launcher.  The executable only needs to exist on the host where MPIRun is executed.  The MPIJob object will create a temporary file on each host, copy the executable, execute it, and then delete the temporary file after each executable terminates.  This means that argv[0] will not be the same on each host and if the current path doesn't exist on each machine then the processes will not run in the current directory. 

When the MPIJob object launches the processes, it passes the account and password to each host.  Each executable is launched by calling LogonUser() and CreateProcessAsUser().  Security is determined by MPIRun.  DCOM provides various levels of security from none to validate and encrypt every packet.   MPIRun sets this security level each time it is run.  Currently I picked a middle value as the default but MPIRun will have a flag to allow the user to choose what level they want.

The launcher is tied to this implementation of MPICH because it communicates with the first launched process in a particular way.   The first process launched creates a listening socket, acquires a port, and communicates this port back to the launcher.   Then the launcher starts all the rest of the processes, informing them of the first process's listening port through an environment variable.  This allows MPICH jobs to be launched without any communication service or reserved port.  The draw back is that it prohibits the use of launchers that can't communicate with the root process in this manner.

Output from each process is merged into MPIRun's standard out.  Both stderr and stdout are merged into a single stream.  The user can hit Ctrl-C to break MPIRun and an Abort call will be sent to the MPIJob obect, killing all the remote processes.  A batch server could send a break event to MPIRun if the process has exceeded its allotted time.

main
Installation
Compiling an MPI application
MPIRun
Re-Building MPICH.NT
Launcher
Running applications without using the launcher
Subtle configuration options