Category
page 1Fault-tolerant computer systems
computer cluster
set of computers configured in a distributed computing system
OpenVMS
OpenVMS, often referred to as just VMS, is a multi-user, multiprocessing and virtual memory-based operating system. It is designed to support time-sharing, batch processing, transaction processing and workstation applications. Customers using OpenVMS include banks and financial services, hospitals and healthcare, telecommunications operators, network information services, and industrial manufacturers. During the 1990s and 2000s, there were approximately half a million VMS systems in operation worldwide.
Spanning Tree Protocol
network protocol that builds a loop-free logical topology for Ethernet networks
server farm
collection of computer servers

uptime
Uptime is a measure of system reliability, expressed as the period of time a machine, typically a computer, has been continuously working and available. Uptime is the opposite of downtime.
hot swapping
replacing computer system components without shutting down the system
single point of failure
part whose failure will disrupt the entire system
transaction processing
information processing that is divided into individual, indivisible operations
snapshot
recorded state of a computer storage system at a particular point in time
ECC memory
computer memory which detects and corrects errors
Byzantine fault
Fault in a computer system that presents different symptoms to different observers
data replication
making multiple copies of information to ensure consistency in computing
consensus
concept in computer science
redundancy
use of a number of critical components for securing one or more functions of a system with the intention of increasing its reliability, usually in the form of a backup or fail-safe design

failover
thumb|4G cellular failover for network resiliency

fail-safe
In engineering, a fail-safe is a design feature or practice that, in the event of a failure of the design feature, inherently responds in a way that will cause minimal or no harm to other equipment, to the environment or to people. Unlike inherent safety to a particular hazard, a system being "fail-safe" does not mean that failure is naturally inconsequential, but rather that the system's design prevents or mitigates unsafe consequences of the system's failure. If and when a "fail-safe" system fails, it remains at least as safe as it was before the failure. Since many types of failure are poss
quantum error correction
techniques that enable reliable delivery of quantum data over unreliable quantum communication channels
Tandem Computers
American computer hardware manufacturer ( 1974–1997)
Paxos
family of protocols for solving consensus in a network of unreliable processors
high-availability cluster
cluster of separate computers designed for high availability at the application level (even if individual nodes fail)

data redundancy
presence of data additional to the actual data that may permit correction of errors in stored or transmitted data
Round-robin DNS
load balancing technique in the Internet's Domain Name System (DNS)
log-structured file system
structure of file system that writes all information to a circular buffer
conflict-free replicated data type
data structure replicated across a network such that any replica is updatable independently, concurrently and without coordination, and any inconcistencies are algorithmically resolved with replicas’ states guaranteed to eventually converge
NonStop
family of fault-tolerant servers
application checkpointing
a technique for inserting fault tolerance into computing systems
data synchronization
process of bidirectionally maintaining consistency of data stored in multiple data stores
SpaceWire
SpaceWire is a spacecraft communication network based in part on the IEEE 1355 standard of communications. It is coordinated by the European Space Agency (ESA) in collaboration with international space agencies including NASA, JAXA, and RKA.
disk array
disk storage system which contains multiple disk drives
Triple modular redundancy
redundancy using three systems and voting to determine the result
disk mirroring
replication of logical disk volumes onto separate physical hard disks in real time to ensure continuous availability
Stratus Technologies
American producer of computer servers and software
Raft
consensus algorithm
hot spare
spare component that is an active and connected part of a working system, ready to take over functionality with little or no interruption
multipath I/O
redundant IO technology
reliability, availability and serviceability
quality of robustness of computer hardware
disk array controller
computer device that manages a hardware RAID array