Contents

Introduction

As any engineering system, a network (or a part thereof) can fail for a number of reasons: faulty hardware, software bugs, breakage of physical medium (e.g. fiber cables), and even because of power outages. These are examples of failures that affect specific and, generally, separated network elements. The issue is that designed networks are resilient (robust) to failures, i.e., the networks are able to carry (possible decreased) traffic demands also when a part of network resources are temporarily failed.

In this report, the availability of the network under different failures is analyzed.

Let be G(N,E) a network where N is the set of nodes and E is the set of links. Nodes and links are built on certain resources:

In order to study the reliability of a network, these resources or elements are grouped into failure groups or SRGs (Shared Risk Group).

A failure group is defined as a set of network elements which are affected by the same failure risk. Therefore, all the elements go to a failure state at the same time. Let F be the set of SRGs within the network and f ∈ F a failure group. Then:

Characterization of failure states

The different failure situations are specified by the availability status of the links and nodes. Each failure state (situation) s is characterized by a vector of binary node availability coefficients αns and a vector of binary link availability coefficients αes, where s = 0, 1, ..., S is the predefined list of failure situations. Although it is a common practice in the literature, no single-failure assumption is made. Conversely, we use the concept of shared risk group or failure group.

A shared risk group (SRG) is a set of network elements that are collectively impacted by a specific fault or fault type. For example, a shared risk link group (SRLG) is the union of all the links on those fibers that are routed in the same physical conduit in a fiber-span network. This concept includes, besides shared conduit, other types of compromise such as shared fiber cable, power sharing...

Let F be a set of failure groups defined within the network. We denote f∈F as a failure group. Nodes and links can be attached to none, one or several failure groups. Given a node n, we denote F(n) as the set of failure groups associated to that node. Likewise, F(e) denotes the set of failure groups associated to link e. When a node or link is not attached to any failure group (e.g. F(n)=∅), it is assumed that it will not suffer any fault. Otherwise, a node or link is down if any of the failure group associated is faulty.

Example: In the network defined in the following figure, the links traverse several conduits from the origin node to their respective destination node. If the conduit is cutted, all links are broken. To model this fact, we assume that every conduit is a failure group. Hence, each link is associated to every failure group corresponding to the traversed conduits: F(e12) = {f1, f3}, F(e13) = {f1, f2}, F(e23) = ∅.

Given a failure group f, we denote MTTFf and MTTRf as the Mean Time To Fail and Mean Time To Repair of the group, respectively. The availability Af of a group f is defined as the fraction of time in which the group is up:

We denote S as the set of possible states in which the network can be found. A network state s is defined by the set of affected failure groups. We denote s0 as the normal operation state in which no resource is faulty (s0=∅).

Given a path p, we denote as F(p) as the set of failure groups associated to that path. Typically that is equal to the union of the failure group set of nodes and links within the path.

For example, in the previous example, the path p={e31, e12} is attached to failure groups F(p)={f1, f2, f3}.

Likewise, given a demand d, we denote F(d) as the set of failure groups associated to any path carrying traffic of that demand. Finally, given a demand set D, we denote as F(D) as the set of failure groups traversed by any demand in D.

As an assumption, when a network is in a state with no failures (s0) every demand carries the 100% of the traffic. Conversely, when the network is found in a given state s ≠ s0, it is likely to loss a fraction of traffic, depending on the fault and the protection/restoration mechanism defined within the network.

Given a route pd, a demand d, or a set of demands D, we define the concept of availability on a twofold manner:

This report does not discuss the sequence of events of how network failures are monitored, detected, and mitigated. For more detail of this aspect, see [1]. Also, the repair process following a failure is out of the scope here.

Re-establishment mechanisms

An important feature of the mechanisms assuring network resilience is the way in which resources are re-established in case of failures, i.e., protection versus restoration. The term protection is used to describe mechanisms that take action to restore connections before the failure happens, while restoration refers to mechanisms with such actions taken after the failure. In fact, with protection no actions may be taken and still the network can be resilient, e.g., for networks designed with bifurcated routing (also known as path diversity).

We classified re-establishment techniques into three different categories [2]:

In path re-establishment, when an element of a path fails, a route between the ingress node and the egress node of each active connection traversing the affected element should be found. Conversely, in link re-establishment a backup route between the adjacent nodes of the affected element should be found. Finally, in subpath re-establishment a backup route between a predecessor node before the failure and a node after the failure should be found.

Options for re-establishment approaches

This section, provides a discussion of the principles and options of re-establishment. The options are presented in terms of atomic terms that may be combined to specify recovery approaches. Although some options are attractive, for the sake of simplicity are not considered within Net2Plan.

Availability computation under a failure state

Given a network G(N, E), a traffic demand set D, a routing, a protection/restoration policy, and a network state s with its associated affected failure group set F(s), we can compute the blocked traffic L(pd)s, L(d)s, L(D)s, as the fraction of the planned traffic for path pd, for demand d or for demand set D, respectively, which was carried under the state s0, but is not carried under the state s.

From that values, we can compute availability of paths, demands and network in the following way:

Availability computation under a failure state set

Suppose that a network is composed of a set of states S. S always contains the state s0, and can contain other failure states to consider too. For example, every single failure (only a failure group), or every single or double failure. By doing so, we are assuming that the influence of other network states (e.g. triple failures) is negligible.

Given a state s ≠ s0, the set of failure groups with a fault is denoted by F(s). Therefore, assuming that faults in failure groups happen in an statistically independent way, we have the fraction of time πs in which the network can be found under state s is given by:

Hence, the fraction of time in which network can be found under the state s0 is given by:

where F is the set of all failure groups within the network. We can compute the fraction of time πexcess in which the network is under a network state not considered in S as follows:
In case that πexcess is a low value, we know that the error made when we do not consider every possible network state is low. From that values we can compute the availability values of routes, demands and network in the following way: Note that states not considered in S are included within the "no failure" state s0, that is, we are giving an optimistic estimation of the availability. We can give a pesimistic estimation assuming that during the fraction of time given by πexcess, the network is faulty and its availability is zero:

Information tables

Per-demand availability

#perDemandInfo#

Network-wide availability

#networkInfo#

References

[1] V. Sharma and F. Hellstrand (Eds.), "Framework for Multi-Protocol Label Switching (MPLS)-based Recovery", RFC 3469, 2003.
[2] J. Wang, L. Sahasrabuddhe and B. Mukherjee, "Path vs. Subpath vs. Link Restoration for Fault Management in IP-over-WDM Networks: Performance Comparisons Using GMPLS Control Signaling", IEEE Communications Magazine, vol. 40, no. 11, pp. 80-87, 2002.