[ J(1) | Novell FAQ Home Page ]

J.13 NetWare 4.10 SFT-III (System Fault Tolerance III)

For those people that need very much up time Novell developed SFT-III. SFT-III is an extension to the fault tolerance that NetWare brings you out of the box. By default each server is equipped with SFT-I (the hot fix). If there is a need for it this can be extended to SFT-II. In order to do so you will have to set up Disk Mirroring or Duplexing. This level will protect against hard disk failures. SFT-III is server mirroring. This will protect against server hardware failure.

Because SFT-III is an extension, it means you will have to order it separately. Licences are available in two ways: servers with up to 100 users or for servers with more than 100 users. SFT-III can be installed during the initial installation and an existing NetWare 4.10 server can be upgraded to the SFT-III level.

SFT-III does not support the ability to share the load across the two servers (yet). Plans have been made by Novell to support this as well, but they have indicated this will not be available in the upcoming Green River release. It is planned for the next release after Green River. That version will allow disk clustering like Digital's VAX.

J.13.1 Considerations

Before ordering or installing SFT-III there are some general consideration you should make. Before using SFT-III on your server you should try to figure out if your planned configuration will work with SFT-III. Special attention should be paid to the backup solution, UPS, the MSL (Mirrored Server Link) card and, if needed, the network management software. Another good thing to pay attention to is that the hardware is identical. This may sound obvious, but by this I mean also the revision numbers. Sites that choose to upgrade a exiting NW 4.10 server to SFT-III level and have to order the additional hardware could end up with the same network card (e.g.: NE3200) but with a newer revision level. It is a good thing to have the revision levels (and BIOS dates etc.) on both machines identical.

J.13.1.1 Backup Considerations

Making a backup with a SFT-III server is possible, but not straight forward. Keep in mind that a backup device is part of the hardware. Therefore, if the primary server (that is the one of the two servers (also referred to as IO_Engine) which is acting as server) fails and your backup device is connected to that server, you will not have any backup device until it is back on line again.

Second thing to pay attention to is that the backup unit is connected to an IO_Engine and many backup products address the hardware directly. If that is the case, then the backup software has to be loaded on the IO_Engine. Not every product supports this option. Loading the software in the MS_Engine (this is the "server" part that is protected) can result in an error because no backup device was found.

At this time there are only two products that I am aware of that can run on a SFT-III server: Sbackup of Novell (very slow) and ARCserve 5.01g (from Cheyenne). Other products (even ARCserve 6.0) do not run on SFT-III at this moment.

J.13.1.2 UPS Considerations

It is a good idea to provide each IO_Engine with a separate UPS (connected to different fuses). Otherwise the SFT-III server would be shut down if a fuse blows. UPS management software like Powerchute communicate with the UPS by a comm port. This is also an example of software that can not run in the IO_Engine. At this moment it is recommended to use the Novell UPS monitoring board with the Novell UPS.NLM. This NLM can run in the IO_Engine and needs this board to communicate with the UPS.

J.13.1.3 MSL Considerations

Novell published a list of certified MSLS (as of 11/94) in their TID21974 on 09JAN95 (unable to locate this anymore). It is a good thing to use a certified MSL card. Novell also recommends to assign to the MSL card's interrupt at the highest priority, 10 being ideal. Try to avoid using interrupts 2/9 or 15, if possible. Interrupt 9 cascades to interrupt 2, and NetWare reserves interrupt 15 for lost hardware interrupts.

Also consult the NSEpro for known problems with the selected MSL card (if you have access to it). That way I found the NMSL card we had selected to use in our Compaq Proliant was not a good idea. In general you should use a high speed link for the MSL (either fibber or 100Base-Tx). The advantage of using a fiber MSL link is that you can place your secondary server far away, providing it is also connected to the network.

J.13.1.4 Network Management Considerations

Many network management utilities (e.g. Frye) try to communicate with the server hardware and also try to examine server statistics. Because these two are split with SFT-III it is not always possible to use these utilities. A network management program should be capable of communicating with the hardware (LAN adapter, processor speed etc.) in the IO_Engine and getting the statistics (# of connected users, buffers, utilisation etc.) from the MS_Engine. The only utility that I am aware of at this moment that can do this is Novell's ManageWise (release 2.0 and higher)

J.13.2 SFT-III and Raid 5

Raid 5 is a technology like mirroring and duplexing. What you need to keep in mind is the speed difference between these solutions. I have no practical experience but looking at the technology I would say duplexing works fastest (less CPU overhead). Second would be the RAID 5 solution and slowest would be mirroring (assuming you use the same controller for the mirrored drives otherwise it is duplexing). These last options provide no fault tolerance for hardware failures other than for the hard disk.

These solutions can be combined with SFT-III, but personally I would say it is overkill. At the moment a disk fails and you have not implemented any of these technics the primary IO_Engine will fail and the secondary will take over. At that moment there is the possibility to swap the hard drive and bring the IO_Engine up again. Recreating the netware segments and mirrored pairs of drives will do the job.

J.13.3 Will SFT-III work on NetWare 3.12 ?

As far as I know SFT-III was only available for NetWare 3.11. Anyone that wants a 3.12 SFT-III server should investigate this. It could be that you can only get a 4.x version (all registered 3.1x SFT-III users were upgraded to the 4.x level by Novell).

J.13.4 Will NetWare Connect work on SFT-III ?

No, NetWare Connect won't run on SFT-III. NetWare Connect uses modems, which are connected to a comm port. The comm port is part of the I/O of a fileserver. That means the modems are connected to the IO_Engine. If the IO_Engine fails, the secondary will take over. Because your modems will be connected to the other IO_Engine these sessions can't be taken over at the time of the switch over. In other words all users logged in by NetWare Connect would lose their connection. This is part of the reason why NetWare Connect won't run on a SFT-III machine.

J.13.5 ARCserve 5.01g and SFT-III configuration

ARCserve requires a special configuration in order to run on a SFT-III machine. Look at Cheyenne's ARCserve release notes section E. Configuring ARCserve to run on NetWare 4.1 SFT III. You will find that in order to run it on a SFT-III machine you will have to load 2 additional files in the IO_Engine that do not come with ARCserve or NW 4.1. The file IODAI40.NLM is a Novell file (can be found on many BBS's and the Internet). The file ARC_SFT3.NLM can be obtained from Cheyenne.

Some time ago there was a message that a NW 4.10 server with all latest patches (Libup8, 410pt3 & landr5) could crash during a remote server backup with ARCcserve. Cheyenne has a patch for it. Keep in mind that using ARCcserve on SFT-III actualy makes a remote server backup.

One other issue. We installed the manager on our server and when we ran the Windows manager it looked fine until we tried to work with the databases. It turned out that this was caused by the way ARCserve defines the location of the program. By default it installs the manager in:

\\SERVERNAME\VOL_NAME\ARCSERVE\MANAGER\arcserve.exe.

After changing this to:

F:\ARCSERVE\MANAGER\arcserve.exe

and changing the working directory corresponding it all worked great.

J.13.6 TCP/IP and SFT-III configuration

In order to run TCP/IP on a SFT-III server you will have to set up a separate sub-net for your MS_Engine. The IO_Engines must be configured to act as a router. The MS_Engine will act as an end node. Note that both IO_Engines communicate with THE SAME IP address.

		TCP/IP Configuration Example

	      +-------------------------------+
	      |  Mirrored Server - MS Engine  |
	      |     193.67.129.200            |
	      +-------------------------------+
				|
				|
				|            Virtual LAN
	----------------------------------------------
				|
	   (IO Engine-MS Engine | Internal Interface)

	      |------------193.67.129.201-----------|
     +-----------------+                 +-----------------+
     |  IO Engine 1    |                 | IO Engine 2     |
     |  193.67.129.131 |                 | 193.67.129.132  |
     +-----------------+                 +-----------------+
	      |                                   |
	      |             Real Network          |
	------------------------------------------------

J.13.7 SFT-III Engines swapping

It is not normal for a SFT-III server to regularly switch primary and secondary engines. If this happens try looking at the file io$log.err in the system directly. It records the problems if the primary engine and the secondary engine swap. Perhaps there is a hint in there. Look carefully to what happened just before the switch over. If that doesn't help you could try running conlog.nlm. Conlog can be loaded in all three engines and with the option FILE=SYS:\SYSTEM\MSLOG.TXT (etc.) you can specify a different output file for each engine. If the switchover happens again you can have a look at the outputfile(s) to see if something strange happened.

J.13.8 MS_Engines produced a different output

MS_Engine produced different outputs is a very difficult problem to trouble shoot. It is of most importance to understand the way SFT-III is designed. SFT-III is designed as Hardware Fault Tolerant and has no added ability to protect itself from a software bug. However if you have a system which has this problem, only the secondary machine should be affected unless SET parameters are set to halt both machines.

SFT-III Architecture

SFT-III was designed so the "event queue" in each machine, primary and secondary, would receive and then process the same events independently. Every instruction that the processor executes comes from this event queue. i.e., if a request is generated for a file from the disk, the request is put on the event queue of the MS_Engine. An identical request is then sent, across the MSL, to the other machine and placed on the event queue of that machine. When the event has been processed, one last consistency check is made, comparing the results of the MS_Engine from each machine. Note here that there is only one MS_Engine that is presented to the user even though each machine has processed the request independent of each other. If the result of each machine is identical, the data is sent to the IO_Engine, packaged into packet form, and sent out the LAN channel. If the results are compared and are not the same, then you get the Abend: MS_Engine Produced Different Outputs.

Troubleshooting (according to Novell):

The most likely causes of the "Different Outputs" Abend is that, either one machine has traversed a slightly different code path compared to the other machine, or, that the NLM that is running has encountered a variable in the code that has not been initialised. The value of an uninitialized variable is completely random, and therefore increases the likelihood that the MS_Engines are going to produce different results.

The "Different Outputs" problem is NOT a hardware problem, and it is not an MSL, LAN, or DISK problem. Use the following questions and objectives to aid in identifying the NLM that is causing the Abend.

Questions

  - What modules are being loaded in the MS_Engine ?

  - Is there a sequence of events which will cause the server to Abend ?

  - Can the Abend be reproduced using this sequence of events?

  - Can a specific NLM be singled out as the cause of the Abend?

Objectives:

  - Stabilise the customer's environment.

  - Modify one system item at a time.

  - Reproduce the problem in a non-production environment.

  - Trace the problem.

  - Correct the Module.

Troubleshooting (based on experience):

This problem can also be caused by external devices. In our configuration this problem was caused by a FDDI hub. One IO_Engine was attached by FDDI with a SAS port and the other IO_Engine was connected by FDDI to a DAS port. Another thing to check is whether the MSL link is functioning correctly. Is the speed of your MSL link equal or higher than the LAN link? (using a slow MSL link and a fast LAN link is not a good idea). Also check your interrupt settings. Does your MSL link use a higher priority than the other adapters? Last: Novell recommends NOT to use PCI adapters from different manufactures. Try to use only one PCI card or none at all if possible.

J.13.9 Additional information

http://netware.novell.com/discover/ssnwsft.htm

http://netware.novell.com/database/docs/wpdb20.htm

http://netware.novell.com/discover/can4reli.htm

Another good source are the Novell manuals. All related SFT instructions are in the normal manuals/dynatext,particularly Chapter 5 of the installation manual, and Appendix C of the Supervising the network manual. If you have access to it, you could have a look at the NSEpro. It contains several documents relating to SFT-III.

J.13.10 Other products of interest: Vinca StandbyServer

How StandbyServer (2.0) for NetWare Works

Vinca uses a second server as an automatic standby in case the primary machine fails. Data from the primary machine is mirrored to the standby machine using standard IPX protocols. StandbyServer 2.0 uses real-time NetWare disk mirroring to keep exact copies of all system data on both the primary and standby machines. Since StandbyServer uses standard IPX connections to transfer data, a dedicated link is not required, but it is recommended. Any standard IPX board and driver combination can be used as the Vinca link. The data can then be routed, bridged or use a shared, high-speed backbone. The connection status of the Vinca link and the network link between the two machines are constantly monitored to ensure that the primary server is operating . These multiple checks avoid any inadvertent switchover to the standby machine. If the primary server has failed, the standby machine automatically takes over the role of the primary server using the same server name, login scripts, bindery or NDS and IPX address as the failed server.

Vinca StandbyServer for NetWare has autoswitch. It automatically switches between the halted main server and the standby machine. With the new 32-bit clients from Novell or Microsoft, the client connection is maintained, not requiring the user to relogon to the switched server after NetWare reinitializes the disks. Users will experience only a momentary pause while the switchover takes place, and their connection to the server is retained. With older client software, the users simply log back into the server using their same name and password as they did on the failed server. For more information on Vinca, contact:

http://www.vinca.com

J.14 Mirroring

Let's take a moment to categorize. I would like to define the difference between "online" and "near line" redundant systems. An on-line system requires no user and/or administrative intervention to recover from a fault condition. Conversely a "near line" system requires user and/or administrative intervention (workstation rebooting etc.)

There are a number of products that will "mirror" servers for you. At the top of the list is:

Novell's SFT III (http://www.novell.com) This I categorize as an "on-line redundant system."

Advantage: Total protection from hardware related abends; Seamless server "switch over" (Your users will not know that the system has had a failure)

Disadvantages: Identical Hardware required; Cannot protect from software abends; NLM's must be specially certified for SFT III.

Next are the products I categorize as "Near Line redundant system."

Lan Integrity (http://www.netint.com)

Advantages: Totally recoverable from hardware and/or software errors. One to many protection. One server can protect many targets. "15 second" recovery time. Backup Solution.

Disadvantages: Users must reboot to reestablish services; Does not protect Printing Services. (Additional administrative overhead procedure required, duplicate queues); Does not support advanced Clients (Microsoft's NDS client, Novel 32 Client for 95; Novell 32Bit client for DOS/Windows. (Support expected at some point); ExtraTape Drives required (DLT etc.)

Vinca/StandBy 32 (http://www.vinca.com)

Advantages: Identical hardware not required; Full NDS, Bindery and advance client support; "Autoswitch feature takes care of swapping in the standby server without intervention from the network administrator.

Disadvantage: They claim it will protect against both hardware and software errors. Since there is no seamless "switch over" in either case. it really means that it can recovery reliably from either failure. However if there are corrupted files you will still have to wait for the automated vrepair to be run before the users can user the system.

LanShadow for Horizon (Http://www.horizon.com)

Advantage: "Network mirroring tool that provides fast recovery from server failures and assures constant availability of critical network data. Runs as NLM and configured to mirror entire servers, volumes directories or individual files, including open ones, to a designated backup server or open space on another production server. LANshadow doesn't require a dedicated backup platform physically connected to and configured exactly like a production server, nor does it require any dedicated hardware or tape drive. LANshadow supports all NetWare environments including 4.x; also includes support for Macintosh name file space."

Disadvantage: Server "switch over" not automated. NDS support ???

[Thanks to Colin St Rose for this info]

FSMIRROR is a single NLM which runs on each NetWare fileserver. It supports both NetWare 3.12 and NetWare 4.1. Essentially FSMIRROR polls specified locations on another server and checks if any files have changed at that location. If there are changed files then FSMIRROR copies them over to the specified location on the server running the copy of FSMIRROR. It will handle upto 16 different replication paths and can overwrite read only files. Replication times are configurable.

The FSMIRROR.CFG file provides for the saving of configuration information should the NLM be unloaded or the server downed. Logging of the files replicated is also included. An auto-reconnect feature is also available should the server being polled go down at any stage.

As of April 1998, a current copy of FSMIRROR can be found at :

    http://developer.novell.com/devres/nlec/download/download.htm

[ J(1) | Novell FAQ Home Page ]