Open Fabrics Enterprise Distribution (OFED) MPI in OFED 1.4.0 README December 2008 =============================================================================== Table of Contents =============================================================================== 1. Overview 2. MVAPICH 3. Open MPI 4. MVAPICH2 =============================================================================== 1. Overview =============================================================================== Three MPI stacks are included in this release of OFED: - MVAPICH 1.1.0-3143 - Open MPI 1.2.8 - MVAPICH2 1.2p1 Setup, compilation and run information of MVAPICH, Open MPI and MVAPICH2 is provided below in sections 2, 3 and 4 respectively. 1.1 Installation Note --------------------- In Step 2 of the main menu of install.pl, options 2, 3 and 4 can install one or more MPI stacks. Please refer to docs/OFED_Installation_Guide.txt to learn about the different options. The installation script allows each MPI to be compiled using one or more compilers. Users need to set, per MPI stack installed, the PATH and/or LD_LIBRARY_PATH so as to install the desired compiled MPI stacks. 1.2 MPI Tests ------------- OFED includes four basic tests that can be run against each MPI stack: bandwidth (bw), latency (lt), Intel MPI Benchmark, and Presta. The tests are located under: /mpi///tests/, where is /usr by default. 1.4 Selecting Which MPI to Use: mpi-selector -------------------------------------------- Depending on how the OFED installer was run, multiple different MPI implementations may be installed on your system. The OFED installer will run an MPI selector tool during the installation process, presenting a menu-based interface to select which MPI implementation is set as the default for all users. This MPI selector tool can be re-run at any time by the administrator after the OFED installer completes to modify the site-wide default MPI implementation selection by invoking the "mpi-selector-menu" command (root access is typically required to change the site-wide default). The mpi-selector-menu command can also be used by non-administrative users to override the site-wide default MPI implementation selection by setting a per-user default. Specifically: unless a user runs the MPI selector tool to set a per-user default, their environment will be setup for the site-wide default MPI implementation. Note that the default MPI selection does *not* affect the shell from which the command was invoked (or any other shells that were already running when the MPI selector tool was invoked). The default selection is only changed for *new* shells started after the selector tool was invoked. It is recommended that once the default MPI implementation is changed via the selector tool, users should logout and login again to ensure that they have a consistent view of the default MPI implementation. Other tools can be used to change the MPI environment in the current shell, such as the environment modules software package (which is not included in the OFED software package; see http://modules.sourceforge.net/ for details). Note that the site-wide default is set in a file that is typically not on a networked filesystem, and is therefore specific to the host on which it was run. As such, it is recommended to run the mpi-selector-menu command on all hosts in a cluster, picking the same default MPI implementation on each. It may be more convenient, however, to use the mpi-selector command in script-based scenarios (such as running on every host in a cluster); mpi-selector effects all the same functionality as mpi-selector-menu, but is intended for automated environments. See the mpi-selector(1) manual page for more details. Additionally, per-user defaults are set in a file in the user's $HOME directory. If this directory is not on a network-shared filesystem between all hosts that will be used for MPI applications, then it also needs to be propagated to all relevant hosts. Note: The MPI selector tool typically sets the PATH and/or LD_LIBRARY_PATH for a given MPI implementation. This step can, of course, also be performed manually by a user or on a site-wide basis. The MPI selector tool simply bundles up this functionality in a convenient set of command line tools and menus. 1.4 Updating MPI Installations ------------------------------ Note that all of the MPI implementations included in the OFED software package are the versions that were available when OFED v1.4 was released. They have been QA tested with this version of OFED and are fully supported. However, note that administrators can go to the web sites of each MPI implementation and download / install newer versions after OFED has been successfully installed. There is nothing specific about the OFED-included MPI software packages that prohibit installing newer/other MPI implementations. It should be also noted that versions of MPI released after OFED v1.4 are not supported by OFED. But since each MPI has its own release schedule and QA process (each of which involves testing with the OFED stack), it may sometimes be desirable -- or even advisable, depending on how old the MPI implementations are that are included in OFED -- to download install a newer version of MPI. The web sites of each MPI implementation are listed below: - Open MPI: http://www.open-mpi.org/ - MVAPICH: http://mvapich.cse.ohio-state.edu/ - MVAPICH2: http://mvapich.cse.ohio-state.edu/overview/mvapich2/ =============================================================================== 2. MVAPICH MPI =============================================================================== This package is a 1.1.0 version of the MVAPICH software package, and is the officially supported MPI stack for this release of OFED. See http://mvapich.cse.ohio-state.edu for more details. 2.1 Setting up for MVAPICH -------------------------- To launch MPI jobs, its installation directory needs to be included in PATH and LD_LIBRARY_PATH. To set them, execute one of the following commands: source /mpi///bin/mpivars.sh -- when using sh for launching MPI jobs or source /mpi///bin/mpivars.csh -- when using csh for launching MPI jobs 2.2 Compiling MVAPICH Applications: ----------------------------------- ***Important note***: A valid Fortran compiler must be present in order to build the MVAPICH MPI stack and tests. The default gcc-g77 Fortran compiler is provided with all RedHat Linux releases. SuSE distributions earlier than SuSE Linux 9.0 do not provide this compiler as part of the default installation. The following compilers are supported by OFED's MVAPICH package: Gcc, Intel,Pathscale and PGI. The install script prompts the user to choose the compiler with which to build the MVAPICH RPM. Note that more than one compiler can be selected simultaneously, if desired. For details see: http://mvapich.cse.ohio-state.edu/support To review the default configuration of the installation, check the default configuration file: /mpi///etc/mvapich.conf 2.3 Running MVAPICH Applications: --------------------------------- Requirements: o At least two nodes. Example: mtlm01, mtlm02 o Machine file: Includes the list of machines. Example: /root/cluster o Bidirectional rsh or ssh without a password Note: ssh will be used unless -rsh is specified. In order to use rsh, add to the mpirun_rsh command the parameter: -rsh *** Running OSU tests *** /usr/mpi/gcc/mvapich-1.1.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.1.0/tests/osu_benchmarks-3.0/osu_bw /usr/mpi/gcc/mvapich-1.1.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.1.0/tests/osu_benchmarks-3.0/osu_latency /usr/mpi/gcc/mvapich-1.1.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.1.0/tests/osu_benchmarks-3.0/osu_bibw /usr/mpi/gcc/mvapich-1.1.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.1.0/tests/osu_benchmarks-3.0/osu_bcast *** Running Intel MPI Benchmark test (Full test) *** /usr/mpi/gcc/mvapich-1.1.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.1.0/tests/IMB-3.1/IMB-MPI1 *** Running Presta test *** /usr/mpi/gcc/mvapich-1.1.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.1.0/tests/presta-1.4.0/com -o 100 /usr/mpi/gcc/mvapich-1.1.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.1.0/tests/presta-1.4.0/glob -o 100 /usr/mpi/gcc/mvapich-1.1.0/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich-1.1.0/tests/presta-1.4.0/globalop =============================================================================== 3. Open MPI =============================================================================== Open MPI is a next-generation MPI implementation from the Open MPI Project (http://www.open-mpi.org/). Version 1.2.8 of Open MPI is included in this release, which is also available directly from the main Open MPI web site. A working Fortran compiler is not required to build Open MPI, but some of the included MPI tests are written in Fortran. These tests will not compile/run if Open MPI is built without Fortran support. The following compilers are supported by OFED's Open MPI package: GNU, Pathscale, Intel, or Portland. The install script prompts the user for the compiler with which to build the Open MPI RPM. Note that more than one compiler can be selected simultaneously, if desired. Users should check the main Open MPI web site for additional documentation and support. (Note: The FAQ file considers InfiniBand tuning among other issues.) 3.1 Setting up for Open MPI --------------------------- Selecting to use Open MPI via the mpi-selector-mpi and mpi-selector tools will perform all the necessary setup for users to build and run Open MPI applications. If you use the MPI selector tools, you can skip the rest of this section. If you do not wish to use the MPI selector tools, the Open MPI team strongly advises users to put the Open MPI installation directory in their PATH and LD_LIBRARY_PATH. This can be done at the system level if all users are going to use Open MPI. Specifically: - add /bin to PATH - add /lib to LD_LIBRARY_PATH is the directory where the desired Open MPI instance was installed ("instance" refers to the compiler used for Open MPI compilation at install time.). If you are using a job scheduler to launch MPI jobs (e.g., SLURM, Torque), setting the PATH and LD_LIBRARY_PATH is still required, but it does not need to be set in your shell startup files. Procedures describing how to add these values to PATH and LD_LIBRARY_PATH are described in detail at: http://www.open-mpi.org/faq/?category=running 3.2 Open MPI Installation Support / Updates ------------------------------------------- The OFED package will install Open MPI with support for TCP, shared memory, and the OpenFabrics network stacks. No other networks are supported by the OFED Open MPI installation. Open MPI supports a wide variety of run-time environments. The OFED installer will not include support for all of them, however (e.g., Torque and PBS-based environments are not supported by the OFED-installed Open MPI). The ompi_info command can be used to see what support was installed; look for plugins for your specific environment / network / etc. If you do not see them, the OFED installer did not include support for them. As described above, administrators or users can go to the Open MPI web site and download / install either a newer version of Open MPI (if available), or the same version with different configuration options (e.g., support for Torque / PBS-based environments). 3.3 Compiling Open MPI Applications ----------------------------------- (copied from http://www.open-mpi.org/faq/?category=mpi-apps -- see this web page for more details) The Open MPI team strongly recommends that you simply use Open MPI's "wrapper" compilers to compile your MPI applications. That is, instead of using (for example) gcc to compile your program, use mpicc. Open MPI provides a wrapper compiler for four languages: Language Wrapper compiler name ------------- -------------------------------- C mpicc C++ mpiCC, mpicxx, or mpic++ (note that mpiCC will not exist on case-insensitive file-systems) Fortran 77 mpif77 Fortran 90 mpif90 ------------- -------------------------------- Note that if no Fortran 77 or Fortran 90 compilers were found when Open MPI was built, Fortran 77 and 90 support will automatically be disabled (respectively). If you expect to compile your program as: > gcc my_mpi_application.c -lmpi -o my_mpi_application Simply use the following instead: > mpicc my_mpi_application.c -o my_mpi_application Specifically: simply adding "-lmpi" to your normal compile/link command line *will not work*. See http://www.open-mpi.org/faq/?category=mpi-apps if you cannot use the Open MPI wrapper compilers. Note that Open MPI's wrapper compilers do not do any actual compiling or linking; all they do is manipulate the command line and add in all the relevant compiler / linker flags and then invoke the underlying compiler / linker (hence, the name "wrapper" compiler). More specifically, if you run into a compiler or linker error, check your source code and/or back-end compiler -- it is usually not the fault of the Open MPI wrapper compiler. 3.4 Running Open MPI Applications: ---------------------------------- Open MPI uses either the "mpirun" or "mpiexec" commands to launch applications. If your cluster uses a resource manager (such as SLURM), providing a hostfile is not necessary: > mpirun -np 4 my_mpi_application If you use rsh/ssh to launch applications, they must be set up to NOT prompt for a password (see http://www.open-mpi.org/faq/?category=rsh for more details on this topic). Moreover, you need to provide a hostfile containing a list of hosts to run on. Example: > cat hostfile host1.example.com host2.example.com host3.example.com host4.example.com > mpirun -np 4 -hostfile hostfile my_mpi_application (application runs on all 4 hosts) In the following examples, replace with the number of hosts to run on, and with the filename of a valid hostfile listing the hosts to run on (unless you are running under a supported resource manager, in which case a hostfile is unnecessary). Also note that Open MPI is highly run-time tunable. There are many options that can be tuned to obtain optimal performance of your MPI applications (see the Open MPI web site / FAQ for more information: http://www.open-mpi.org/faq/). It is worth noting that the "mpi_leave_pinned" run-time tunable parameter is usually *very* good for running benchmarks, but can actually be detrimental to real-world MPI applications -- and is therefore disabled by default. When running the benchmarks listed below, it is advistable enable the "mpi_leave_pinned" option in order to see maximum performance (*). Example 1: Running the OSU bandwidth: > cd /usr/mpi/gcc/openmpi-1.2.8/tests/osu_benchmarks-3.0 > mpirun -np --mca mpi_leave_pinned 1 -hostfile osu_bw Example 2: Running the Intel MPI Benchmark benchmarks: > cd /usr/mpi/gcc/openmpi-1.2.8/tests/IMB-3.1 > mpirun -np --mca mpi_leave_pinned 1 -hostfile IMB-MPI1 Example 3: Running the Presta benchmarks: > cd /usr/mpi/gcc/openmpi-1.2.8/tests/presta-1.4.0 > mpirun -np --mca mpi_leave_pinned 1 -hostfile com -o 100 (*) The "mpi_leave_pinned" option can increase bandwidth and decrease latency for applications that repeatedly send and/or receive from the same buffers. If your application does not repeatedly send/receive from the same buffers, mpi_leave_pinned will likely have little effect on your performance. 3.5 More Open MPI Information ----------------------------- Much, much more information is available about using and tuning Open MPI (to include OpenFabrics-specific tunable parameters) on the Open MPI web site FAQ: http://www.open-mpi.org/faq/ Users who cannot find the answers that they are looking for, or are experiencing specific problems should consult the "how to get help" web page for more information: http://www.open-mpi.org/community/help/ =============================================================================== 4. MVAPICH2 MPI =============================================================================== MVAPICH2 is an MPI-2 implementation which includes all MPI-1 features. It is based on MPICH2 and MVICH. MVAPICH2 provides many features including fault-tolerance with checkpoint-restart, RDMA_CM support, iWARP support, optimized collectives, on-demand connection management, multi-core optimized and scalable shared memory support, and memory hook with ptmalloc2 library support. The ADI-3-level design of MVAPICH2 supports many features including: MPI-2 functionalities (one-sided, collectives and data-type), multi-threading and all MPI-1 functionalities. It also supports a wide range of platforms (architecture, OS, compilers, InfiniBand adapters and iWARP adapters). More information can be found on the MVAPICH2 project site: http://mvapich.cse.ohio-state.edu/overview/mvapich2/ A valid Fortran compiler must be present in order to build the MVAPICH2 MPI stack and tests. The following compilers are supported by OFED's MVAPICH2 MPI package: gcc, intel, pgi, and pathscale. The install script prompts the user to choose the compiler with which to build the MVAPICH2 MPI RPM. Note that more than one compiler can be selected simultaneously, if desired. The install script prompts for various MVAPICH2 build options as detailed below: - Implementation (OFA or uDAPL) [default "OFA"] - OFA (IB and iWARP) Options: - ROMIO Support [default Y] - Shared Library Support [default Y] - Checkpoint-Restart Support [default N] * requires an installation of BLCR and prompts for the BLCR installation directory location - uDAPL Options: - ROMIO Support [default Y] - Shared Library Support [default Y] - Cluster Size [default "Small"] - I/O Bus [default "PCI-Express"] - Link Speed [default "SDR"] - Default DAPL Provider [default ""] * the default provider is determined based on detected OS For non-interactive builds where no MVAPICH2 build options are stored in the OFED configuration file, the default settings are: Implementation: OFA ROMIO Support: Y Shared Library Support: Y Checkpoint-Restart Support: N 4.1 Setting up for MVAPICH2 --------------------------- Selecting to use MVAPICH2 via the MPI selector tools will perform most of the setup necessary to build and run MPI applications with MVAPICH2. If one does not wish to use the MPI Selector tools, using the following settings should be enough: - add /bin to PATH The above is the directory where the desired MVAPICH2 instance was installed ("instance" refers to the path based on the RPM package name, including the compiler chosen during the install). It is also possible to source the following files in order to setup the proper environment: source /bin/mpivars.sh [for Bourne based shells] source /bin/mpivars.csh [for C based shells] In addition to the user environment settings handled by the MPI selector tools, some other system settings might need to be modified. MVAPICH2 requires the memlock resource limit to be modified from the default in /etc/security/limits.conf: * soft memlock unlimited MVAPICH2 requires bidirectional rsh or ssh without a password to work. The default is ssh, and in this case it will be required to add the following line to the /etc/init.d/sshd script before sshd is started: ulimit -l unlimited It is also possible to specify a specific size in kilobytes instead of unlimited if desired. The MVAPICH2 OFA build requires an /etc/mv2.conf file specifying the IP address of an Infiniband HCA (IPoIB) for RDMA-CM functionality or the IP address of an iWARP adapter for iWARP functionality if either of those are desired. This is not required by default, unless either of the following runtime environment variables are set when using the OFA MVAPICH2 build: RDMA-CM ------- MV2_USE_RDMA_CM=1 iWARP ----- MV2_USE_IWARP_MODE=1 Otherwise, the OFA build will work without an /etc/mv2.conf file using only the Infiniband HCA directly. The MVAPICH2 uDAPL build requires an /etc/dat.conf file specifying the DAPL provider information. The default DAPL provider is chosen at build time, with a default value of "ib0", however it can also be specified at runtime by setting the following environment variable: MV2_DEFAULT_DAPL_PROVIDER= More information about MVAPICH2 can be found in the MVAPICH2 User Guide: http://mvapich.cse.ohio-state.edu/support/ 4.2 Compiling MVAPICH2 Applications ----------------------------------- The MVAPICH2 compiler command for each language are: Language Compiler Command -------- ---------------- C mpicc C++ mpicxx Fortran 77 mpif77 Fortran 90 mpif90 The system compiler commands should not be used directly. The Fortran 90 compiler command only exists if a Fortran 90 compiler was used during the build process. 4.3 Running MVAPICH2 Applications --------------------------------- 4.3.1 Running MVAPICH2 Applications with mpirun_rsh --------------------------------------------------- >From release 1.2, MVAPICH2 comes with a faster and more scalable startup based on mpirun_rsh. To launch a MPI job with mpirun_rsh, password-less ssh needs to be enabled across all nodes. Note: ssh will be used by default. In order to use rsh, use the -rsh option on the mpirun_rsh commandline. For more options, see mpirun_rsh -help or the MVAPICH2 user guide. *** Running 4 processes on 4 nodes *** $ cat > hostfile node1 node2 node3 node4 $ mpirun_rsh -np 4 -hostfile hostfile /path/to/my_mpi_app *** Running OSU tests *** /usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.0/osu_bw /usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.0/osu_latency /usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.0/osu_bibw /usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.0/osu_bcast *** Running Intel MPI Benchmark test (Full test) *** /usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/IMB-3.1/IMB-MPI1 *** Running Presta test *** /usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/presta-1.4.0/com -o 100 /usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/presta-1.4.0/glob -o 100 /usr/mpi/gcc/mvapich2-1.2p1/bin/mpirun_rsh -np 2 -hostfile /root/cluster /usr/mpi/gcc/mvapich2-1.2p1/tests/presta-1.4.0/globalop 4.3.2 Running MVAPICH2 Applications with mpd and mpiexec -------------------------------------------------------- Launching processes in MVAPICH2 is a two step process. First, mpdboot must be used to launch MPD daemons on the desired hosts. Second, the mpiexec command is used to launch the processes. MVAPICH2 requires bidirectional ssh or rsh without a password. This is specified when the MPD daemons are launched with the mpdboot command through the --rsh command line option. The default is ssh. Once the processes are finished, stopping the MPD daemons with the mpdallexit command should be done. The following example shows the basic procedure: 4 Processes on 4 Hosts Example: $ cat >hostsfile node1.example.com node2.example.com node3.example.com node4.example.com $ mpdboot -n 4 -f ./hostsfile $ mpiexec -n 4 ./my_mpi_application $ mpdallexit It is also possible to use the mpirun command in place of mpiexec. They are actually the same command in MVAPICH2, however using mpiexec is preferred. It is possible to run more processes than hosts. In this case, multiple processes will run on some or all of the hosts used. The following examples demonstrate how to run the MPI tests. The default installation prefix and gcc version of MVAPICH2 are shown. In each case, it is assumed that a hosts file has been created in the specific directory with two hosts. OSU Tests Example: $ cd /usr/mpi/gcc/mvapich2-1.2p1/tests/osu_benchmarks-3.0 $ mpdboot -n 2 -f ./hosts $ mpiexec -n 2 ./osu_bcast $ mpiexec -n 2 ./osu_bibw $ mpiexec -n 2 ./osu_bw $ mpiexec -n 2 ./osu_latency $ mpdallexit Intel MPI Benchmark Example: $ cd /usr/mpi/gcc/mvapich2-1.2p1/tests/IMB-3.1 $ mpdboot -n 2 -f ./hosts $ mpiexec -n 2 ./IMB-MPI1 $ mpdallexit Presta Benchmarks Example: $ cd /usr/mpi/gcc/mvapich2-1.2p1/tests/presta-1.4.0 $ mpdboot -n 2 -f ./hosts $ mpiexec -n 2 ./com -o 100 $ mpiexec -n 2 ./glob -o 100 $ mpiexec -n 2 ./globalop $ mpdallexit