Designing Robust Systems through Synthesis of Parametric Markov Chains

Radu Calinescu, Milan Ceska, Simos Gerasimou, Marta Kwiatkowska, Nicola Paoletti,

We present a method for the synthesis of system designs that satisfy strict quality requirements, are Pareto-optimal with respect to a set of quality optimisation criteria, and are robust to variations in the system parameters. To this end, we model the design space of the system under development as a parametric continuous-time Markov chain (pCTMC) with discrete and continuous parameters that correspond to alternative system architectures and to the ranges of possible values for configuration parameters, respectively. Given this pCTMC and required tolerance levels for the configuration parameters, our method produces a sensitivity-aware Pareto- optimal set of designs, which allows the modeller to inspect the ranges of quality attributes induced by these tolerances, thus enabling the effective selection of robust designs. Through application to two systems from different domains, we demonstrate the ability of our method to synthesise robust designs with a wide spectrum of useful trade-offs between quality attributes and sensitivity.

The main contributions of this work are:

RODES: Robust Design Synthesis Method

System Modelling

A parametric continuous-time Markov chain (pCTMC) defines the design space of the system under development.
approach
A candidate design C(P',q) defines a fixed discrete parameter valuation and continuous parameter values from a small region.
approach
A quality attribute over pCTMCs-defined design spaces is defined using continuous stochastic logic (CSL).
approach
A quality requirement of a system with design space given by a pCTMC C(P,Q) can be:
    • An objective function fi, i ∈ I, that corresponds to a quality system attribute and defined in terms of a CSL path formula φi, i ∈ I, such that:
      approach
    • A boolean constraint cj, j ∈ J, that corresponds to a quality system attribute and defined in terms of a CSL path formula φi, i ∈ I, and a threshold ~j rj such that:
      approach
  • Parametric Markov Chain Synthesis Problem

    Consider a system with design space C(P,Q), quality requirements given by objective functions {fi}i ∈ I and constraints {cj}j ∈ J, and designer-specified tolerances k}k ∈ K  for the continuous parameters of the system.
    The set of feasible designs F for the system, i.e., of candidate designs that meet the tolerances k}k ∈ K  and satisfy the constraints {cj}j ∈ J is given by:
    approach
    The parametric Markov chain synthesis problem consists of finding the Pareto-optimal set PS of candidate designs (i.e. pCTMCs) with tolerances k}k ∈ K that satisfy the constraints {cj}j ∈ J  and are non-dominated with respect to the objective functions {fi}i ∈ I:
    approach
    The sensitivity-aware dominance relation '≺' between two candidate designs is defined as follows.
    approach

    Parametric CTMC Synthesis Method

    Computing the Pareto-optimal design set PS is typically unfeasible, as the design space C(P,Q) is infinite due to its real-valued parameters. Moreover, every candidate design C(P′, q) corresponds to an infinite set of CTMCs that cannot all be analysed to establish its quality and sensitivity.
    Our pCTMC synthesis method, whose high-level steps are shown in Algorithm 1 below, combines search-based software engineering (SBSE) techniques with techniques for the effective analysis of pCTMCs, and produces a close approximation of the Pareto-optimal design set.
    approach

    Function AnalyseDesign

    This function establishes the quality attributes and sensitivity of candidate designs by employing precise parameter synthesis techniques to compute safe enclosures of the satisfaction probability of CSL formulae over pCTMCs.
    Given a pCTMC C(P′,q) and a CSL path formula φ these techniques provide a safe under-approximation ∧qmin and a safe over-approximation ∧qmax of the minimal and maximal probability that C(P′,q) satisfies φ:
    approach
    This allows us to safely approximate the bounds {fi ⊥, fi ⊤}i ∈ I  of the objective functions, and the constraints {cj}j ∈ J.

    Function GetCandidateDesigns

    This function uses established multiobjective optimisation genetic algorithms (MOGAs), e.g., NSGA-II or SPEA2, and encodes a candidate design C(P′,q) that satisfies a fixed set of tolerances k}k ∈ K   by the gene tuple (p, q), where p ∈ P is the centre point of the continuous parameter region P′.
    Given a population of candidate designs, several GetCandidateDesigns executions evolve the population into 'fitter' designs using MOGA selection, crossover and mutation. The evolution of the design population terminates after a fixed number of design evaluations or when a predetermined number of successive iterations generate populations with no significantly fitter designs.

    Software Systems Used to Evaluate RODES

    Google File System

    The Google File System (GFS) represents the replicated file system used by Google’s search engine. GFS partitions files into chunks of equal size, and stores copies of each chunk on multiple chunk servers. A master server monitors the locations of these copies and the chunk servers, replicating the chunks as needed. During normal operation, GFS stores CMAX copies of each chunk. However, as servers fail and are repaired, the number c of copies for a chunk may vary from 0 to CMAX.
    We adapted the CTMC model of the lifecycle of a GFS chunk by considering several continuous and discrete parameters that a designer of the system needs to decide. To evaluate our design synthesis method, we assume that GFS designers must select the hardware failure and repair rates cHardFail and cHardRepair of the chunk servers, and the maximum number of chunks NC stored on a chunk server within the designer-specified ranges. These parameters reflect the fact that designers can choose from a range of physical servers, can select different levels of service offered by a hardware repair workshop, and can decide a maximum workload for chunk servers.
    The GFS model is encoded in the PRISM modelling language extended with the evolve constructs.
    approach
    We formulate a pCTMC synthesis problem for quality requirements given by two maximising objective functions and one constraint:
    ID Description
    f1 φ11 = ¬SL1 U[10,60] SL1, where SL1 = M_up ∧ c > 0 holds in states where service level 1 (master up and at least one chunk copy available) is provided
    f2 φ2 = C≤60, where a reward of 1 is assigned to the states where the number of running chunk servers is at least 0.5M.
    c1 ψ1 = C≤60 with threshold ~1 r1 ≡ '≤ 5', where a transition reward of 1 is assigned to each chunk replication transition.

    Workstation Cluster

    We extend the CTMC models of a cluster availability management system. This CTMC models a system comprising two sub-clusters, each with N workstations and a switch that connects the workstations to a central backbone. For each component, we consider failure, inspection and repair rates (where repairs are initiated only after an inspection detects failures), and we assume that designers must decide these rates for workstations, i.e., the real-valued parameters wsFail, wsFail and wsRepair for our pCTMC, respectively. Additionally, we assume that designers must select the sub-cluster size N, and must choose between an expensive repair implementation (i.e., pCTMC module) with a 100% success probability and a cheaper repair module with 50% success probability.
    The cluster model is encoded in the PRISM modelling language extended with the evolve constructs.
    approach
    We formulate a pCTMC synthesis problem for quality requirements given by two maximising objective functions and one constraint:
    ID Description
    f1 φ1 = ¬premium U[20,100] premium, where premium denotes a system service where at least 1.25N workstations are connected and operating
    f2 φ2 = C≤100, where the reward 1 is assigned only to the states where the number of operation clusters is between 1.2N and 1.6N.
    c1 ψ1 = C≤100 with threshold ~1 r1 ≡ '≤ 80', where transition rewards are associated with repair actions of the workstations, switches and backbone.

    Evaluation

    We used RODES to generate Pareto-optimal design sets for the Google File system and the cluster management system, with tolerances γ ∈ {0.01, 0.02, 0.05} for the evolvable parameters of the two pCTMCs, and parameters ε1 = ε2 = ε ∈ {0, 0.05, 0.1}.

    Evaluating RODES on the Google File System

    The figure below shows the Pareto fronts obtained using the "lower bound" definition for the objective functions f1 (x-axis) and f2 (y-axis) over candidate designs, and parameters ε1 = ε2 = ε ∈ {0, 0.05, 0.1}. Boxes represent quality-attribute regions, coloured by sensitivity (red: sensitive, blue: robust). Red-bordered boxes and arrows indicate the sub-optimal robust designs. For each front, we report mean sensitivity and mean volume.
    Sensitivity-aware Pareto fronts for the GFS model over objectives f1 (x-axis) and f2 (y-axis).
    The figure below shows the corresponding synthesised sensitivity-aware Pareto-optimal designs set. Rectangles in x-y plane correspond to the continuous parameter regions (cHWF: hardware failure rate; cHWR: hardware repair rate) induced by the tolerance γ. The box heights show the value of the discrete parameter NC. Boxes are coloured by sensitivity.
    Synthesised Pareto-optimal designs for the GFS model (cHWF: hardware failure rate; cHWR: hardware repair rate; NC: number of chunks).
    Statistical Analysis
    We also analysed the performance of our NSGA-II-based RODES against a tool variant that uses random search (RS). For each tool variant and combination of γ ∈ {0.01, 0.02} and ε ∈ {0, 0.05, 0.1} we carried out 30 independent runs. Then, we compared the Pareto fronts achieved by each variant against this reference front by using the following metrics
    • M1 = w x Iεnorm+ (1−w) x sensnorm
    • M2 = w x IIGD;norm+ (1−w) x sensnorm
    which use a weight w ∈ [0, 1] to combine normalised versions of the normalised design sensitivity with the following established (but sensitivity-agnostic) Pareto-front quality indicators:
    • Unary additive epsilon (Iε), which measures the minimum additive term by which the elements of the objective vectors from a Pareto front approximation must be adjusted so as to dominate the objective vectors from the reference front. Smaller values denote Pareto front approximations with better convergence.
    • Inverted Generational Distance (IIGD), which provides an "error measure" as the Euclidean distance in the objective space between the reference front and the Pareto front approximation. Smaller values denote Pareto front approximations with better convergence and diversity.
    The figure below compares RODES and RS across our ε, γ combinations using metrics M1 and M2 with w = 0.5.
    RODES vs. random search (RS) comparison for combinations of γ ∈ {0.01, 0.02} and ε ∈ {0, 0.05, 0.1}, over 30 independent runs. For both metrics M1(left) and M2(right) smaller is better.

    Evaluating RODES on the Cluster Availability Management System

    As for GFS, the figure below shows the Pareto fronts obtained using the "lower bound" definition for the objective functions f1 (x-axis) and f2 (y- axis) over candidate designs, and parameters ε1 = ε2 = ε ∈ {0, 0.05, 0.1}. Boxes represent quality-attribute regions, coloured by sensitivity (red: sensitive, blue: robust). Red-bordered boxes and arrows indicate the sub- optimal robust designs. For each front, we report mean sensitivity and mean volume.
    Sensitivity-aware Pareto fronts for the cluster model over objectives f1 (x-axis) and f2 (y-axis).
    The figure below shows the corresponding synthesised sensitivity-aware Pareto-optimal designs set. Rectangles in x-y plane correspond to the continuous parameter regions (cHWF: hardware failure rate; cHWR: hardware repair rate) induced by the tolerance γ. The box heights show the value of the discrete parameter NC. Boxes are coloured by sensitivity.
    The "projection" of the quality-attribute regions on the f1-N plane corresponding to synthesised sensitivity-aware designs. Synthesised Pareto-optimal designs for the GFS model (N: number of workstation in each sub-cluster).
    The table below shows the average design sensitivity for two variants of the synthesis problem corresponding to different ranges of possible values for the parameter N, for several combinations of $γ and ε. The results show that sensitivity-aware designs (i.e. where ε > 0) for N ∈ {10..15} have lower sensitivity than for N ∈ {11..14}, validating our observation that more robust designs are obtained for N=10 or N=15.