As any engineering system, a network (or a part thereof) can fail for a number of reasons: faulty hardware, software bugs, breakage of physical medium (e.g. fiber cables), and even because of power outages. These are examples of failures that affect specific and, generally, separated network elements. The issue is that designed networks are resilient (robust) to failures, i.e., the networks are able to carry (possible decreased) traffic demands also when a part of network resources are temporarily failed.
In this report, the availability of the network under different failures is analyzed.
Let be G(N,E) a network where N is the set of nodes and E is the set of links. Nodes and links are built on certain resources:
In order to study the reliability of a network, these resources or elements are grouped into failure groups or SRGs (Shared Risk Group).
A failure group is defined as a set of network elements which are affected by the same failure risk. Therefore, all the elements go to a failure state at the same time. Let F be the set of SRGs within the network and f ∈ F a failure group. Then:
The different failure situations are specified by the availability status of the links and nodes. Each failure state (situation) s is characterized by a vector of binary node availability coefficients αns and a vector of binary link availability coefficients αes, where s = 0, 1, ..., S is the predefined list of failure situations. Although it is a common practice in the literature, no single-failure assumption is made. Conversely, we use the concept of shared risk group or failure group.
A shared risk group (SRG) is a set of network elements that are collectively impacted by a specific fault or fault type. For example, a shared risk link group (SRLG) is the union of all the links on those fibers that are routed in the same physical conduit in a fiber-span network. This concept includes, besides shared conduit, other types of compromise such as shared fiber cable, power sharing...
Let F be a set of failure groups defined within the network. We denote f∈F as a failure group. Nodes and links can be attached to none, one or several failure groups. Given a node n, we denote F(n) as the set of failure groups associated to that node. Likewise, F(e) denotes the set of failure groups associated to link e. When a node or link is not attached to any failure group (e.g. F(n)=∅), it is assumed that it will not suffer any fault. Otherwise, a node or link is down if any of the failure group associated is faulty.
Example: In the network defined in the following figure, the links traverse several conduits from the origin node to their respective destination node. If the conduit is cutted, all links are broken. To model this fact, we assume that every conduit is a failure group. Hence, each link is associated to every failure group corresponding to the traversed conduits: F(e12) = {f1, f3}, F(e13) = {f1, f2}, F(e23) = ∅.
Given a failure group f, we denote MTTFf and MTTRf as the Mean Time To Fail and Mean Time To Repair of the group, respectively. The availability Af of a group f is defined as the fraction of time in which the group is up:
We denote S as the set of possible states in which the network can be found. A network state s is defined by the set of affected failure groups. We denote s0 as the normal operation state in which no resource is faulty (s0=∅).
Given a path p, we denote as F(p) as the set of failure groups associated to that path. Typically that is equal to the union of the failure group set of nodes and links within the path.
Likewise, given a demand d, we denote F(d) as the set of failure groups associated to any path carrying traffic of that demand. Finally, given a demand set D, we denote as F(D) as the set of failure groups traversed by any demand in D.
As an assumption, when a network is in a state with no failures (s0) every demand carries the 100% of the traffic. Conversely, when the network is found in a given state s ≠ s0, it is likely to loss a fraction of traffic, depending on the fault and the protection/restoration mechanism defined within the network.
Given a route pd, a demand d, or a set of demands D, we define the concept of availability on a twofold manner:
This report does not discuss the sequence of events of how network failures are monitored, detected, and mitigated. For more detail of this aspect, see [1]. Also, the repair process following a failure is out of the scope here.
An important feature of the mechanisms assuring network resilience is the way in which resources are re-established in case of failures, i.e., protection versus restoration. The term protection is used to describe mechanisms that take action to restore connections before the failure happens, while restoration refers to mechanisms with such actions taken after the failure. In fact, with protection no actions may be taken and still the network can be resilient, e.g., for networks designed with bifurcated routing (also known as path diversity).
We classified re-establishment techniques into three different categories [2]:
In path re-establishment, when an element of a path fails, a route between the ingress node and the egress node of each active connection traversing the affected element should be found. Conversely, in link re-establishment a backup route between the adjacent nodes of the affected element should be found. Finally, in subpath re-establishment a backup route between a predecessor node before the failure and a node after the failure should be found.
Recovery granularity. This option allows for the protection of a fraction of traffic within the backup path.
In Net2Plan this option is configured on a per-route basis by the boolean parameter partialRecoveryAllowed.
Post recovery operation. When traffic is flowing on the backup path, decisions can be made as to whether to let the traffic remain on the backup path and consider it as a new primary path or to do a switch back to the old or to a new working path. Primary routes are pre-configured with the appropriate behavior to take when they are restored to service. The choices are revertive and non-revertive mode. In revertive mode, if the primary path always is the preferred path, this path will be used whenever it is available. In non-revertive mode, there is no preferred path or it may be desirable to minimize further disruption of the service brought on by a revertive switching operation, hence a switch-back to the original primary path is not desired.
In Net2Plan this option is configured on a per-route basis by the boolean parameter revertiveMode.
Backup segment resource usage. In the case of pre-reserved backup segments, there is the question of what use these resources may be put to when the backup path is not in use. There are three options:
In Net2Plan only the Dedicated-resource and Shared-resource options are considered.
Loop prevention. A loop in an path is defined as the path traversing a node for more than once. In subpath and link restoration techniques, since the backup segment is identified as part of the primary path, it should be ensured that these two parts do not overlap with each other at any node.
In Net2Plan the kernel does not check for routing loops, the user is free to implement loop prevention mechanisms in their provisioning algorithms.
Split path protection. In split path protection, multiple backup paths are allowed to carry the traffic of a working path based on a certain load splitting ratio. This is especially useful when no single backup path can be found to carry traffic from the primary path in case of a fault.
Provisioning algorithms have the option to define new routes, so users are free to decide the routing under failure states.
Given a network G(N, E), a traffic demand set D, a routing, a protection/restoration policy, and a network state s with its associated affected failure group set F(s), we can compute the blocked traffic L(pd)s, L(d)s, L(D)s, as the fraction of the planned traffic for path pd, for demand d or for demand set D, respectively, which was carried under the state s0, but is not carried under the state s.
From that values, we can compute availability of paths, demands and network in the following way:
Classic availability. In this case, the availability on a given state s is 1 if all planned traffic remains being carried. Otherwise, availability is equal to 0:
Weighted availability. In this case, the availability is the fraction of traffic which survives:
Suppose that a network is composed of a set of states S. S always contains the state s0, and can contain other failure states to consider too. For example, every single failure (only a failure group), or every single or double failure. By doing so, we are assuming that the influence of other network states (e.g. triple failures) is negligible.
Given a state s ≠ s0, the set of failure groups with a fault is denoted by F(s). Therefore, assuming that faults in failure groups happen in an statistically independent way, we have the fraction of time πs in which the network can be found under state s is given by:
Hence, the fraction of time in which network can be found under the state s0 is given by: