Open Fabrics Enterprise Distribution (OFED) CHELSIO T3 RNIC RELEASE NOTES December 2008 The iw_cxgb3 and cxgb3 modules provide RDMA and NIC support for the Chelsio S310/320 and R310/320 series adapters. Make sure you choose the 'cxgb3' and 'libcxgb3' options when generating your ofed-1.4 rpms. ============================================ New for ofed-1.4 ============================================ - 7.0 Firmware support. See below for more information on updating your RNIC to the latest firmware. - Memory Managment Extensions including: - Fast register memory regions - Invalidate local memory region work request - Zero stag support via the local DMA lkey field - Read with invalidate local stag work request - RDS bcopy mode enabled for iWARP devices ============================================ Recent Enhancements ============================================ - Various MPI libraries are enabled via a new iw_cxgb3 module option called peer2peer. When loading iw_cxgb3, set peer2peer=1 to enable Intel MPI version 3.1.038, HP MPI version 2.02.05.01, OpenMPI (will be released with OpenMPI-1.3), and Scali MPI (will be available in version 3.13.7). This option must be set on all systems in your cluster. See more info below on running these MPIs. NOTE: None of these MPIs are included in the ofed-1.4 release. Contact the specific vendors for obtaining the MPI code. Open MPI can be pulled from www.open-mpi.org. - Large memory registration. User applications can now register > 30MB memory regions. ============================================ Enabling Various MPIs ============================================ For OpenMPI, Intel MPI, HP MPI, and Scali MPI: you must set the iw_cxgb3 module option peer2peer=1 on all systems. This can be done by writing to the /sys/module file system during boot. EG: # echo 1 > /sys/module/iw_cxgb3/parameters/peer2peer Or you can add the following line to /etc/modprobe.conf to set the option at module load time: options iw_cxgb3 peer2peer=1 For Intel MPI, HP MPI, and Scali MPI: Enable the chelsio device by adding an entry to /etc/dat.conf for the chelsio interface. For instance, if your chelsio interface name is eth2, then the following line adds a DAT device named "chelsio" for that interface: chelsio u1.2 nonthreadsafe default libdaplcma.so.1 dapl.1.2 "eth2 0" "" ============= Intel MPI: ============= The following env vars enable Intel MPI version 3.1.038. Place these in your user env after installing and setting up Intel MPI: export RSH=ssh export DAPL_MAX_INLINE=64 export I_MPI_DEVICE=rdssm:chelsio export MPIEXEC_TIMEOUT=180 export MPI_BIT_MODE=64 Note: I_MPI_DEVICE=rdssm:chelsio assumes you have an entry in /etc/dat.conf named "chelsio". Contact Intel for obtaining their MPI with DAPL support. ============= HP MPI: ============= To run HP MPI applications, use these mpirun options: -prot -e DAPL_MAX_INLINE=64 -UDAPL EG: $ mpirun -prot -e DAPL_MAX_INLINE=64 -UDAPL -hostlist r1-iw,r2-iw ~/tests/presta-1.4.0/glob Where r1-iw and r2-iw are hostnames mapping to the chelsio interfaces. Also this assumes your first entry in /etc/dat.conf is for the chelsio device. Contact HP for obtaining their MPI with DAPL support. ============= Scali MPI: ============= The following env vars enable Scali MPI. Place these in your user env after installing and setting up Scali MPI for running over Infiniband: export DAPL_MAX_INLINE=64 export SCAMPI_NETWORKS=chelsio export SCAMPI_CHANNEL_ENTRY_COUNT="chelsio:128" Note: SCAMPI_NETWORKS=chelsio assumes you have an entry in /etc/dat.conf named "chelsio". Contact Scali for obtaining their MPI with DAPL support. ============= OpenMPI: ============= OpenMPI iWARP support is only available in OpenMPI version 1.3 or greater. Open MPI will work without any specific configuration via the openib btl. Users wishing to performance tune the configurable options may wish to inspect the receive queue values. Those can be found in the "Chelsio T3" section of mca-btl-openib-hca-params.ini. ============================================ Loadable Module options: ============================================ The following options can be used when loading the iw_cxgb3 module to tune the iWARP driver: cong_flavor - set the congestion control algorithm. Default is 1. 0 == Reno 1 == Tahoe 2 == NewReno 3 == HighSpeed snd_win - set the TCP send window in bytes. Default is 32kB. rcv_win - set the TCP receive window in bytes. Default is 256kB. crc_enabled - set whether MPA CRC should be negotiated. Default is 1. markers_enabled - set whether to request receiving MPA markers. Default is 0; do not request to receive markers. NOTE: The Chelsio RNIC fully supports markers, but the current OFA RDMA-CM doesn't provide an API for requesting either markers or crc to be negotiated. Thus this functionality is provided via module parameters. mpa_rev - set the MPA revision to be used. Default is 1, which is spec compliant. Set to 0 to connect with the Ammasso 1100 rnic. ep_timeout_secs - set the number of seconds for timing out MPA start up negotiation and normal close. Default is 60. peer2peer - Enables connection setup changes to allow peer2peer applications to work over chelsio rnics. This enables the following applications: Intel MPI HP MPI Open MPI Scali MPI Set peer2peer=1 on all systems to enable these applications. The following options can be used when loading the cxgb3 module to tune the NIC driver: msi - whether to use MSI or MSI-X. Default is 2. 0 = only pin 1 = only MSI or pin 2 = use MSI/X, MSI, or pin, based on system ============================================ Updating Firmware: ============================================ This release requires firmware version 7.x, and Protocol SRAM version 1.1.x. This firmware can be downloaded from http://service.chelsio.com. If your distro/kernel supports firmware loading, you can place the chelsio firmware and psram images in /lib/firmware, then unload and reload the cxgb3 module to get the new images loaded. If this does not work, then you can load the firmware images manually: Obtain the cxgbtool tool and the update_eeprom.sh script from Chelsio. To build cxgbtool: # cd # make && make install Then load the cxgb3 driver: # modprobe cxgb3 Now note the ethernet interface name for the T3 device. This can be done by typing 'ifconfig -a' and noting the interface name for the interface with a HW address that begins with "00:07:43". Then load the new firmware and eeprom file: # cxgbtool ethxx loadfw # update_eeprom.sh ethxx # reboot ============================================ Testing connectivity with ping and rping: ============================================ Configure the ethernet interfaces for your cxgb3 device. After you modprobe iw_cxgb3 you will see one or two ethernet interfaces for the T3 device. Configure them with an appropriate ip address, netmask, etc. You can use the Linux ping command to test basic connectivity via the T3 interface. To test RDMA, use the rping command that is included in the librdmacm-utils rpm: On the server machine: # rping -s -a 0.0.0.0 -p 9999 On the client machine: # rping -c -VvC10 -a server_ip_addr -p 9999 You should see ping data like this on the client: ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs ping data: rdma-ping-2: CDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrst ping data: rdma-ping-3: DEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu ping data: rdma-ping-4: EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuv ping data: rdma-ping-5: FGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvw ping data: rdma-ping-6: GHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx ping data: rdma-ping-7: HIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy ping data: rdma-ping-8: IJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz ping data: rdma-ping-9: JKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA client DISCONNECT EVENT... # ============================================ Addition Notes and Issues ============================================ 1) To run uDAPL over the chelsio device, you must export this environment variable: export DAPL_MAX_INLINE=64 2) If you have a multi-homed host and the physical ethernet networks are bridged, or if you have multiple chelsio rnics in the system, then you need to configure arp to only send replies on the interface with the target ip address: sysctl -w net.ipv4.conf.all.arp_ignore=2 3) If you are building OFED against a kernel.org kernel later than 2.6.20, then make sure your kernel is configured with the cxgb3 and iw_cxgb3 modules enabled. This forces the kernel to pull in the genalloc allocator, which is required for the OFED iw_cxgb3 module. Make sure these config options are included in your .config file: CONFIG_CHELSIO_T3=m CONFIG_INFINIBAND_CXGB=m 4) If you run the RDMA latency test using the ib_rdma_lat program, make sure you use the following command lines to limit the amount of inline data to 64: server: ib_rdma_lat -c -I 64 client: ib_rdma_lat -c -I 64 server_ip_addr