High Availability Whitepaper

Overstock Specials!

WHITE PAPER
Product Overview of HATS High Availability
INTRODUCTION
DEFINITION OF SYSTEM AVAILABILITY
HATS HA's Initialization
H.A. TECHNICAL SOLUTIONS HA CUSTOMIZATION
SUPPORTED CONFIGURATIONS
SYSTEM REQUIREMENTS

INTRODUCTION:

Information is an important asset for an enterprise in the 1990s. Today's computer systems must provide reliable and timely information and services to assist personnel in making informed decisions crucial to the daily operations of the modern enterprise.
Despite the rapid evolution in all aspects of computer technology, both the computer hardware and software are prone to numerous failure conditions. Employing methods to minimize exposure to as many failure conditions as possible will significantly increase the use of a company's resources and become a direct indicator of successful business operations.
Enterprises today are demanding their new computer systems be "right-sized" for quicker deployment, cost reductions in ownership, decrease in support and maintenance expenses, and increase in support for both homogeneous and heterogeneous distributed client/server environments. The concepts of "data warehouses" and "replication services" provide the mechanisms to ensure the correct information is always available to the appropriate personnel so they can make the important business decisions in time-critical situations.
This new form of information service must be constantly monitored and tuned to provide reliable and accurate information delivery. Hardware failure of these "central repositories" could prove harmful in today' s competitive environment.
top

DEFINITION OF SYSTEM AVAILABILITY:

Availability of critical information services is affected by both scheduled and unscheduled system downtimes. Although scheduled downtime for system maintenance and upgrades are inevitable; they are fatal to information services that are considered non-interruptible. In addition, unscheduled downtime is unpredictable and should be avoided. Human errors, operating system failures, computer hardware failures, and network failures are usually the cause for most unscheduled downtimes.
In order to describe the methods for preventing these failures, we must first understand the definitions of the various levels of availability. They are normal availability, high availability, and fault tolerance.

Normal Availability Systems (NAS):
Normal availability systems are defined as general-purpose computer hardware and software systems that have no hardware redundancy or software enhancement to provide fault-processing recovery. They require manual, human intervention to identify and correct/repair the failed component(s) and restart the system before resuming normal operations.
High Availability System:
High availability systems are defined as loosely coupled NAS with redundant hardware components managed by software that provides fault detection and correction procedures to maximize the availability of the critical services and applications provided by that system. These systems require no manual, human intervention to identify a failed component, execute a procedure to avert a system failure, and notice the averted failure. This configuration minimizes the possibility of immediate data loss and service interruption.
High Availability Models:
There are two distinct High Availability Models for client-server architectures. They are the Replicated Services Model and the Failover Model.
Replicated Services Model:
This model utilizes distributed applications and distributed databases on multiple servers in the LAN/WAN environment where the data is replicated to some or all of the servers. When a server failure occurs, the data and applications are accessible from an alternate server.
Failover Model:
This model utilizes duplicate server hardware configurations in which one server has the role of an active server for data and application services, and the other is a backup server that monitors the state of the active server. When the backup server detects a hardware or software failure that has occurred on the active server, it takes over the role and identity of the active server.
Fault Tolerance:
Fault Tolerance definition consists of proprietary, expensive, and tightly coupled duplicated systems. Fault handling capabilities are integrated into and become a function of the operating system. These systems have spontaneous and fully automatic response to system failures and provide uninterrupted services.
top

THE UNIQUE FEATURES IN H.A. TECHNICAL SOLUTIONS' HA

It is H.A. Technical Solutions's goal to deliver a product that maintains open systems' compatibility with the platforms architecture we support and their operating systems. Our product provides reliable accessibility to applications and data through automatic failover processing, GUI administration, and alerting facilities. It is flexible to be adapted for differences in individual implementation requirements and future scalability.

Open Systems Compatibility:
H.A. Technical Solutions HA (HATS HA) retains the performance, cost effectiveness, and technology advantages of Open Systems Architecture. It is compatible with existing services native to Solaris (e.g., NFS, Telnet, ftp, etc.) and applications such as Sybase and other common hardware and software products that are available through Sun Microsystems and other third party vendors.
Reliability and Accessibility:
When a failure event occurs, it typically takes 5 seconds for fault detection and 10 to 120 seconds for failover to initiate in RDBMS environment. The failover process for Internet applications can take much less time (often sub 1 second) by using additional software that journals the entries on the Storage system. The failover processing occurs automatically without rebooting the backup server. This provides the highest data and service availability in a distributed SPARC-architecture, client/server environment. There is no single point of failure to prevent accessibility to the data and application services provided by the system.
HATS HA can manage redundant servers, network communication links, network adapters, shared disk subsystems, and SCSI disk adapters to achieve high availability. A standard configuration consists of two SPARC servers or PC's, each with two SCSI interfaces, two Ethernet interfaces, and an internal disk. The servers are connected to an external disk subsystem that could be a single disk or a RAID disk array. One of the networks is a "private" connection shared between the two servers, and the other is the "public" network providing connection to the client workstations for services, data, and applications. (See Figure 1.)
An Active Server is defined as the computer system that provides critical services, data, and/or applications to the client workstations.
A Backup Server is defined as a computer system that is configured for resuming the functionality of the Active Server. A Backup Server can be dedicated or non-dedicated. It can also be an Active Server at the same time.
As a dedicated Backup Server, its function is simply to wait for a failover event and take over the role of the Active Server. When configured as a non-dedicated Backup Server, it can be providing services, data, and/or applications to clients as well as waiting for a failover event to occur. Multiple non-dedicated Backup Servers can be identified and configured to divide up the workload of an Active Server whenever a failover event occurs. These multiples, non-dedicated Backup Servers can then take over the workload of a failed Active Server in a pre-defined scheme that is configured by the System Administrator. Additional redundancy can thus be configured into the routine. For example: If one of the Backup Servers fails to react properly to a failover event, another Backup Server can then detect this failure and take over the workload for the "failed" Backup Server. This definition allows for the Backup Server to also have the role of an Active Server itself.

HATS HA's Initialization:

After the system bootstrap process is completed:

HATS HA Manager is the first initiated daemon process on each server.

HA Manager is the HATS HA Kernel.

The HA Manager initializes the necessary processes and configures the server for failover processing as defined in the HATS HA configuration.

Failover Detection Process:
HATS HA Agents are daemon processes that monitor and manage the defined critical services that are provided by the Active Server. These agents provide status signals for these critical services to the HA Manager in the form of "electronic" heartbeats. While the HA Manager on the active server is receiving an "alive" or "healthy" heartbeat signal from all of its agent processes, it sends a heartbeat to the HA Manager on the backup server. This HA Manager to HA Manager heartbeat function informs the backup server that the active server is currently in good "health" and operating properly. When this active server to backup server heartbeat is absent, the backup server assumes that the active server has failed and initiates the defined failover processes.
If a critical service on the active server fails, the agent will send a "fail" heartbeat to the HA Manager on the active server.
If an agent on the active server itself fails, the HA Manager on that server detects the absence of the agent's heartbeats and, after a configurable time-out, performs designated tasks to restart the applications.
Any critical service on the active server can be monitored by more than one agent process. Each agent is designed to monitor a specific or unique aspect of that service. The service is considered to be available and "healthy" as long as the HA Manager is receiving at least one heartbeat function from the individual agents that are monitoring that service. Thus, if three agents are monitoring one service and one or two of them detect a failure, the HA Manager will not initiate failover processing for that service until the third heartbeat function also signals a failure.
The agents monitor services that include communication, file, disk, network, NFS, NIS, DNS, and RDBMS. End users may also define and develop agents that are customized for special application services that may be deployed.
HATS HA uses the standard RPC calls to exchange information with the HA Manager to agents and HA Manager to HA Manager heartbeat functions. This implementation scheme is beneficial for the end user since these standard mechanisms allow for upgrades to new communication media without impact on the integrity of the HATS HA design.
Failure Processing:
HATS HA issues immediate and automatic actions against specific faults. Following are some possible user-configured responses to selected failed services:

The failure of the service is ignored

A hardware device has failed and an alternate is configured and available

Failover to an alternate device is initiated

A software service such as NFS or DNS has failed

Failover is initiated immediately to resume the failed service

The system providing the service is shut down

A software application has failed

A number of attempts to restart the service on the active server are triggered

If the service fails to restart, the user may choose to:

IGNORE the failure of the service

HALT processing of the service

FAILOVER the service to the designated backup server

Failover Processing:
In the event of a failover event, there will be a brief interruption in services for the failure recognition and failover process to initialize the services on the backup server. This process will occur automatically without human interaction. Once the failover processing is completed, the services provided by the active server will be operating on the backup server.
Either the active server or the backup server, depending on the kind of heartbeat loss and the user-defined procedure to follow once that heartbeat function has failed can initiate a failover process.
A failover process involves transferring to the backup server the network identity of the active server (its TCP-IP and MAC Address), the shared disk subsystem(s) (which can be a mirrored disk set or disk array), and the designated services provided by the active server.
The failover of stateless applications such as NFS, NIS, and DNS are transparent to the end user. Stateful applications such as FTP, Remote Login, and Telnet must re-establish the connection to the server application after the failover event. Other processes such as Client/Server database applications can be programmed to acquire the status of the application/service provided by the active server; then it would be able to resume or reconnect to the service that is now provided by the backup server. This programming technique would allow these stateful applications to appear as stateless applications to the end user.
Unfortunately, terminals that are directly connected to serial ports on the active server will be rendered unusable due to the nature of serial port interfaces. These interfaces cannot be "failed over" to another server. However, network terminal servers and client terminal processes such as X-Windows, Telnet and Remote Login sessions will also be terminated, but the users can then reestablish connections to the backup server and continue to access the applications/services that have been resumed on that server.
When the failed active server recovers or has been repaired, HA Manager may be reconfigured to allow it to now play the role of the backup server, if desired. Otherwise, it can be configured to reclaim its original role as the active server and restore its network identity, resources, applications and services from the backup server and return to active server status.
top

H.A. TECHNICAL SOLUTIONS HA CUSTOMIZATION

Server Configuration:
User configurable parameters are included in the HATS HA product. They are configured to nominal default settings, but can be customized to the enterprise requirements as needed. The following is a partial list of some popular configurable parameters:

Ÿ The server(s) and the defined role(s) in the HA configuration
Ÿ Network configuration information for the private network and public network
Ÿ The backup server(s) designated for specific the active server(s)
Ÿ Script names to execute to initiate specific services
Ÿ Critical services and their corresponding HA Agents.
Ÿ HA Agent process names and their heartbeat failure time-out limits and failure processing/actions
Ÿ HA Manager failure time-out limits and failure processing/actions
Ÿ The maximum number of restart attempts for a failed service before failover processing begins
Ÿ Prerequisite services that must be operating before failover processing is initiated

Shell Scripts:
User defined shell scripts can add functionality, reliable logging of events and notification processing in response to processing of a failed service. User defined shell scripts can perform many tasks, some of which are listed below.

Ÿ Start and stop various services
Ÿ Define follow-up procedures for the failed service(s)
Ÿ Send messages to the system console
Ÿ Writes log file information for troubleshooting purposes
Ÿ Write a message to the system logger
Ÿ Notify support personnel via pager
Ÿ Notify help desk personnel via e-mail or other system management software
Ÿ Broadcast messages to all users

User-Defined Agents:
HATS HA provides both the API and HA agent templates for user-defined agents specific for the user's desired requirements. In order to program these templates, the user needs to have a working knowledge of the C programming language and the application or service that the new agent will monitor. Only the component that will interact with the service needs to be programmed. The API and HA agent templates will provide the other functions.
System Administration:
System administration utilities include support for checking configurations, installing HATS HA onto a new server and configuration and management of the HATS HA environment. All these functions can be performed from a single node on the network or from a system console.
HATS HA provides a graphical user interface (GUI) for easy system administration and HA Agent/Manager monitoring from any character based or X-Windows terminal. The system administrator can issue commands such as:

Ÿ Start and stop the HA Manager
Ÿ Start and stop specific HA Agents
Ÿ Start and stop specific services
Ÿ Force a failover process to occur from either the active or backup server
Ÿ Monitor and query the status of servers, networks, services and agents
Ÿ Verify HA server configuration(s)
Ÿ Install H.A. Technical Solutions HA software on another server
Ÿ Configure and manage the enterprise H.A. Technical Solutions HA environment

HA Initiation:
The first initiated process of HATS HA is a daemon process called HA manager. Each server that starts HATS HA is configured by HA manager according to the settings defined in the HA configuration file. The HA configuration file of each server is the same. All the services and their corresponding agents (the processes that monitor and manage the services) are all specified in the HA configuration file. A server will only initiate the services and the corresponding agents that the server is designated to run.
Failure Detection:
Agents are processes that monitor and manage critical hardware and software services. Agents inform the HA manager of the current status of the services with agent heartbeats. If a service should fail, the agent will stop sending heartbeat to HA manager and a predefined failover procedure will be taken. Each server handles only those services and corresponding agents that are initiated by its HA manager.
A service can be monitored by more than one agent. Each agent monitors different aspects of the service. The service is considered available as long as all of its agents send a heartbeat to the HA manager.
A service can also be agentless. There will be no agent to watch over the availability of the service. The service is considered available as long as the service is properly started; a failover will only occur when the active server itself has failed.
In addition to agents for communication services, file services, disk services, NFS, RDBMS, etc., users may create agents for their own developed applications.
When all the services managed by the HA manager are considered available, a server heartbeat will be broadcast to the related backup server(s). The loss of a server heartbeat from an active server will cause a failover of all of its services to the corresponding backup server(s).
HATS HA utilizes the UDP protocol for exchanging the heartbeat messages between HA managers and HA agents.
Client:
Client workstations are network nodes and/or terminals that access applications and/or services provided by the active servers. They can be Intel-based PC's using NFS and RDBMS application services, X-Terminals, other UNIX servers utilizing network time-sync services, DNS, NIS or NFS mounted file systems, network print devices, network modem pools, etc.
Server Interconnection:
The mode of communication between client and server processes is TCP/IP. The user may define the physical delivery medium that best fits the enterprise requirements. These links include Ethernet (IEEE 802.3), Token Ring (IEEE 802.5), FDDI or Asynchronous SLIP and X.25 connections. HATS HA require two network links between the active and backup servers: a private network and a public network.
The mode of communication between client and server processes is TCP/IP. The user may define the physical delivery medium that best fits the enterprise requirements. These links include Ethernet (IEEE 802.3), Token Ring (IEEE 802.5), FDDI or Asynchronous SLIP and X.25 connections. HATS HA require two network links between the active and backup servers: a private network and a public network.
The private network is used for a dedicated communication link between the active and
Backup servers for exchanging heartbeat messages that inform the machines of each other's HA Agent and Manager status. This private network can be an asynchronous communication link (serial port) if the heartbeat link is point-to-point, or a network link as described above.
The public network is used for clients to access the applications and/or services provided by the active server. This is the "normal" path used for the general access point to these applications and/or services.
Storage Device Configuration:
The storage devices used most often in the computer industry today are SCSI and SCSI-2 hard disks. They provide good price/performance and price/storage capacity ratios. There are several configurations that can be implemented with HATS HA. They are internal disk drives, external disk drives, mirrored disk drives, and Redundant Array of Inexpensive Disk (RAID) subsystems.

Internal disk drives are used for storing the operating system, temporary spool areas, applications and data that is not required to be accessible when a service failover occurs.

External (unshared) disk drives fall into the same category/condition as stated for internal disk drives.

Mirrored disk drives allow for special redundancy and special processing on the active server. These devices provide the "first line of defense" in the event of a single failed disk device. When the HA Agent detects a failure of the primary disk device, it can respond by accessing the respective mirror disk device before initiating a full system failover event.

RAID devices are usually external devices that can be between two or more servers. When connected as multi-hosted devices, they can be simultaneously connected to both the active and backup servers. This provides the backup server a direct physical data path to access the disk partition(s) or device(s) from which to launch the critical applications and/or access the critical data after a failover event has occurred. It can have several different configurations, the most popular being RAID- 1, RAID-3, and RAID5.

RAID-1 provides hardware disk mirroring. The definition of this architecture allows for the RAID hardware to detect a failed disk device and process the failure automatically without notification to the Operating System. However, the HA Agent can be configured to query the RAID controller, detect the failure and initiate a user defined process accordingly.

RAID-3 or RAID-5 configurations allow for the failure of any one disk device without causing the failure of the entire disk subsystem. Due to architecture definition of these RAID levels, the data is divided and written to multiple disk units with a checksum entry associated with each write command entry. When any one disk fails, the missing data can then be reconstructed from the checksum information. As in RAID- 1, the RAID controller manages this automatically. Again, the HA Agent can be configured to query the RAID controller, detect the failure and initiate a user defined process accordingly.

top

SUPPORTED CONFIGURATIONS

Hot Standby - One Active Server
The Hot Standby configuration defines one server as a mission critical system and the backup server as the active server's "immediate replacement." In other words, the backup server's only function is to monitor the heartbeat functions of the active server and wait for a failure event to process. Both servers are connected to the private network, the public network, and a shared external disk subsystem. This configuration can achieve consistent response time after failover processing, but the resource of the backup server is underutilized. This configuration must meet the following minimal requirements:

One designated active server.

One designated backup server (MUST have the identical internal configuration of the active server for memory, etc.).

One private network interface in each server.

One public network interface in each server.

Two (minimum) SCSI / SCSI-2 interfaces in each system.

One (minimum) external SCSI / SCSI-2 dual-hosted disk subsystem.

Hot Standby - Two Active Servers
This configuration designates two servers as mission critical systems and the third server as a hot standby server for the two mission critical servers. This strategy of having a hot standby server to take over if one of the mission critical servers fails, prevents the costly down time associated with failures on a mission critical server. Additionally you can use the hot standby server for other non-mission critical applications.
To deploy this configuration, all servers must be connected to the same public network and a multi-hosted external disk subsystem. The backup server can be configured to achieve specific levels of performance. For example, by planning for the possibility of both active servers failing simultaneously, the backup server can be configured to resume all services from one of active servers at a time. In the event the second server fails, the backup server can be configured to run in a degraded state, or failover only the most important, mission critical services. Conversely, the backup server can be fitted with the physical capacity to resume all services from both active servers at any time and operate within expected parameters.
This configuration must meet the following minimal requirements:

Two designated active servers

One designated backup server

One private network interface in each server

Two (minimum) SCSI / SCSI-2 interfaces in each system

One (minimum) external SCSI / SCSI-2 multi-hosted disk subsystem

One public network interface in each server.

Warm Standby - Two Active Servers:
In this configuration, two mission critical systems are configured as a mutual backup to one another. Utilizing this strategy gains l00% utilization of hardware expenses since the servers are both defined as mission critical. In the event of a failed service, the other will resume the failed service(s). If one of the servers were to fail entirely, the backup server may operate in a degraded state, depending upon how the server is physically configured for memory, CPU, etc.
To deploy this configuration, both servers must be connected to the same public network and an external shared disk subsystem. This configuration must meet the following minimal requirements:

Two designated active servers.

Two public network interfaces in each server.

Two (minimum) SCSI / SCSI-2 interfaces in each system.

One (minimum) external SCSI / SCSI-2 dual-hosted disk subsystem.

top

SYSTEM REQUIREMENTS

Hardware

SPARC-compliant computer system or PC's running Solaris X86.

SGI

HP

Memory

1 MB of RAM for H.A. Technical Solutions HA software.

16 MB recommended to run applications.

Disk Capacity

3.3 MB of internal disk space for H.A. Technical Solutions HA software.

Supported Applications

UNIX services such as NFS, NIS, DNS, etc.

Database (RDBMS) services such as Sybase, Ingress, etc.

User defined client/server applications in the fields such as banking, crucial network services, government services, etc.

top

Look for $avings on surplus and/or re-furbished Sun Storage Products, Servers, and Workstations!

[DMI] [Features] [RMCP] [Solaris] [Whitepaper] [Compare]