for fallible

Eﬀective exchange of information in multi-robot systems is one of the grand challenges of today’s robotics. Here, we address the problem of simultaneously maximizing the (i) resilience to faults and (ii) area coverage of dynamic multi-robot topologies. We want to avoid the onset of single points of failure, i


Introduction
In this paper, we consider the problem of achieving resilience in a system composed by multiple robots using a wireless network to exchange data and coordinate towards a common goal. Resilience was defined in [1] as a property "about systems that can bend without breaking. Resilient systems are self- 5 aware and self-regulating, and can recover from large-scale disruptions". In this paper we consider the effect of single robots' failures and unreliable communication on the overall performance of the multi-robot system: resilience is then represented by how gracefully the performance of the overall system decreases, in the presence of such failures. 10 A key ingredient for achieving resilience to failures is redundancy: the presence of multiple entities that can achieve a task leads to the possibility of success, even with the failure of a limited number of such entities. This is a trait of swarms and multi-robot systems, where the overall capability of the system is achieved as the combination of the capabilities of single robots. However, 15 having a large number of robots per se does not imply redundancy (and thus resilience). In fact, situations may exist in which even a single robot has a critical role, and its failure renders the completion of a task impossible.
For instance, in groups of heterogeneous robots each robot has different capabilities (with regard to sensing, mobility, actuation, etc.). If all the capabilities hazards.
The rest of this paper is organized as follows: Section 2 contextualizes our work among several other related contributions from recent years; we present background theory regarding network properties in Section 3; we discuss the multi-robot system model under evaluation in Section 4; and Section 5 outlines 75 the proposed control architecture. Then, its integration with an online optimization strategy is described in Section 6, and we discuss our simulation results. In Section 7, we introduce our real-life robotic set-up, our experimental campaign, and we comment the obtained results . Finally, Section 8 concludes the article.
Appendices A and B discuss about our choice of a scalarizing function and two 80 fault-injection modes, respectively.

Related work
Swarm robotics is a research field that lies at the intersection of robotics and multi-agent systems and deals with large collections of relatively simple and mostly homogeneous, autonomous robots. Swarm intelligence [13], in par-85 ticular, investigates the coordinated behaviours of these multi-agent systems, while swarm engineering [14] provides tools and methodologies to mimic them. and good network performance". Krupke et al. [30] proposed a heuristic, multicomponent control law allowing robots to follow multiple leaders without breaking the robotic network. Panerati et al. [31] described the recursive creation of robotic chains using situated communication and a distance gradient. Banfi 125 et al. [32] used Integer Linear Programming to optimally redeploy "a team of mobile robots acting as communication relays". The work of Majcherczyk et al. [33] aimed at constructing a logical tree topology and compared the performance of its outwards and inwards creation. Much of the work described up to this point, however, implements either centralized or heuristic, best ef-130 fort approaches. For the sake of scalability and theoretical soundness, we base this work on algebraic connectivity, instead. In spectral graph theory, algebraic connectivity is a proxy measure for the connectedness of a network. Algebraic connectivity, despite representing a global property of a graph, can be estimated in distributed fashion using the Laplacian matrix of a graph. 135 Bertrand and Moonen [34] showed that the distributed computation of the second smallest eigenvalue of a Laplacian (i.e. algebraic connectivity, λ 2 , or just λ) and associated eigenvector (i.e. the Fiedler vector) can be achieved using the power iteration method and normalization based on "cooperative diffusion".
Sahai et al. [35] proposed a "wave propagation"-based approach and local fast 140 Fourier transforms to compute all eigenvalues and local components of each eigenvectors of a graph. Di Lorenzo and Barbarossa [36] presented a stochastic power iteration method that allows each node to estimate algebraic connectivity and use it to adapt its own transmission power. Poonawala and Spong [7] studied the decentralized estimation of algebraic connectivity in strongly connected 145 networks. Finally, Khateri et al.
[9] compared local connectivity maintenance approaches (preserving all the initial links) and global connectivity maintenance approaches (preserving algebraic connectivity) to conclude that the first can be quicker and simpler yet the latter allow to cover larger workspaces.
Several approaches to improve inter-robot communication effectively exploit 150 control strategies that maximize algebraic connectivity as a way to preserve connectedness. Ji and Egerstedt [8] proposed multiple nonlinear feedback laws 6 J o u r n a l P r e -p r o o f Journal Pre-proof based on the Laplacian of a graph to solve the rendez-vous and formation-control problems while ensuring connectedness. De Gennaro and Jadbabaie [37] used an exponential decay model to characterize communication links and a potential-155 based control law that maximizes λ 2 through the supergradient method. Similarly, Yang et al. [3] and Sabattini et al. [4] implemented decentralized, power iteration-based estimation of λ 2 and gradient-based control. Robuffo Giordano et al. [38] enriched this class of control methodologies with the collision avoidance of static obstacles. Gasparri et al. [6] brought it to real-life experimentation 160 with up to four robots. Yet, most of these works overlook certain subtleties required for robust real-world implementations, e.g., the presence of hard and soft errors, or adversarial behaviors.
In their review of fault-tolerance for robot swarms, Winfield and Nembrini [39] pointed out (i) motor failures and (ii) communications failures as 165 hazard types number one and two (in a list of six). Robotic hardware failures and unreliable communication are, in fact, the non-idealities that we inject in our experiments (see Section B). Spanos and Murray [40] originally proposed a locally computable robustness metric called "the geometric connectivity robustness" and their work mostly revolved about modelling it in the context of 170 a purely mathematical framework. Cheng and Wang [41] proposed a hierarchybased method to "re-organize robot teams that require connectivity when robots fail". Using hierarchical graphs, however, can increase the approach's fragility towards the leaders' failures. Hollinger and Singh [42] took a completely different road and proposed a methodology that does not enforce continual connectivity 175 but, rather, only periodic connectivity. Despite the real-world experiment and encouraging performance, this problem still turns out to be NP-Hard. In another alternative approach to a similar problem, Caccamo et al. [43] proposed a communication-aware motion planner "with autonomous repair of wireless connectivity". Yet, this work relies on the existence of fixed-location access points.

signals.
The contribution in this article stems from the theoretical work in [45,10,46] 185 about the simultaneous control of connectivity (through λ 2 ) and robustness (of the multi-robot network towards faults). In [11,47], we originally validated the control law in real robots and in presence of faults through the manual screening of many control gains combinations. In [12], we showed that the selection of these control gains can be delegated to autonomous, online optimization. Yet, 190 we did not investigate the interplay of this level of autonomy with the error models in [11], in this work, we finally fill the gap. We do so by carrying the control and algorithms presented in [12] into the real robotic setup of [11]including two types of fault-injection.

195
Consider an undirected graph G, where V (G) and E (G) ⊂ V (G) × V (G) are the vertex set and the edge set, respectively. Moreover, let W ∈ R N ×N be the weight matrix: each element w ij is a positive number if an edge exists between nodes i and j, zero otherwise. Since G is undirected, then w ij = w ji . Let L ∈ R N ×N be the Laplacian matrix of graph G and D = diag {k i } 200 be the degree matrix, where k i is the degree of the i-th node of the graph, i.e., w ij . The (weighted) Laplacian matrix of the graph is then defined as The Laplacian matrix of an undirected graph G exhibits some remarkable properties regarding its connectivity [48]. Let λ i , i = 1, . . . , N be the eigenvalues 205 of the Laplacian matrix, then: • The eigenvalues are real, and can be ordered such that 0 = λ 1 ≤ λ 2 ≤ . . . ≤ λ N .
• Define now λ = λ 2 . Then, λ > 0 if and only if the graph is connected.
Therefore, λ is defined as the algebraic connectivity of the graph: in a 210 weighted graph, λ is a non-decreasing function of each edge weight. The algebraic connectivity is a good estimator of how well a graph is connected. While global connectivity is a Boolean property of a graph, larger values of λ indicate that the removal of more edges can be tolerated before a disconnection to occur. However, it cannot express the robustness of the graph topology 215 to failures of elements with regard to connectivity maintenance, i.e., how much a graph can tolerate losing edges or vertices without fragmenting.
The robustness to failures is related to some topological properties of the interconnected graph, mainly the degree distribution. Some nodes play important roles in the topology formation, they are called central nodes. These nodes 220 are crucial to the network communication and their failure will likely have a significant effect on the overall network connectivity. Therefore, the evaluation of the impact of central node failures on the network connectivity provides means to assess its robustness to failures.
In this direction, the robustness level proposed in [45] relies on the concept 225 of betweenness centrality (BC) [49] for evaluating the network robustness. BC establishes higher scores for nodes that are contained in most of the shortest paths between every pair of nodes in the network. Thus, removing nodes according to their BC ranking-from highest to lowest values-might quickly lead to network fragmentation. The definition of the robustness level is: 230 Definition 1 (Robustness level [45]). Consider a graph G with N nodes.
Let [v 1 , . . . , v N ] be the list of nodes ordered by descending value of BC. Let ϕ < N be the minimum index i ∈ [1, . . . , N ] such that, removing nodes [v 1 , . . . , v i ] leads to fragmentation, that is, the graph including only nodes v ϕ+1 , . . . , v N being disconnected. Then, the network robustness level of G is defined as: network to failures. We observe that Θ(G) is only an estimation of how far the 235 network is from getting disconnected w.r.t. fraction of nodes removed. In fact, it might be the case that different orderings of nodes with the same BC produce different values of Θ(G).
From a local perspective of robustness assessment, a heuristic to estimate the vulnerability of a node by means of the information acquired from its 1-hop and We summarize this vulnerability assessment as follows: let d(v, u) be the shortest path between nodes v and u, i.e., the minimum number of edges that connect nodes v and u. Subsequently, define Π(v) as the set of nodes that are at a minimum distance of at most 2-hops from v: Moreover, let |Π(v)| be the cardinality of Π(v), and Π 2 (v) ⊆ Π(v) be the set of 2-hop neighbors of v that comprises only nodes whose shortest path from v is exactly equal to 2-hops, namely: Larger values of d would lead to exponentially larger computational require-245 ments that cannot be unjustified for an approximated approach. 1 Now let P ath β (v) ⊆ Π 2 (v) be the set of v's 2-hop neighbors that are reachable through at most β paths, namely: Hence, the value of |P ath β (v)| is an indicator of the magnitude of node fragility w.r.t. connectivity, and the vulnerability level of a node regarding 255 failures is given by P θ (v) ∈ (0, 1): We will hereafter use β = 1, in order to identify 2-hop neighbors that are connected by a single path, which can represent a critical situation for network connectivity in scenarios of failures. A larger value of P θ (v) increases the probability of a robot to set itself as vulnerable, thus improving its robustness.

System model and problem formulation
We assume a team of N mobile robots that are able to communicate with each other within a communication radius R, resulting in a communication topology represented by an undirected graph G.
Let the state of each robot be its position p i ∈ R m , and let p = p T 1 . . . p T N T ∈ R N m be the state vector of the multi-robot system. Let each robot be modeled as a single integrator system, whose velocity can be directly controlled: where u i ∈ R m is a control input.

265
For each robot, the control input has to be designed so that a global objective can be accomplished. As a proof of concept, in the rest of the paper, we will refer to a scenario in which the robots are controlled to spread in a given area while avoiding collisions. However, the proposed methodology can be easily extended to other coordinated control objectives [47]. Note that connectivity is only guaranteed in free-fault environments because failures have an unpredictable nature and cascading failures can seriously damage the system connectivity. On the other hand, the mechanism for resilience improvement was demonstrated to be able to postpone or avoid network frag-285 mentation, including cases where failures are concentrated over short time spans [10].

Overview of the control architecture
Referring to the kinematic model in Equation (6), in the following, we consider each robot to be controlled by means of a control input defined as the superposition of three different terms, that is: The components of the control inputs are defined as follows:

290
• The term u c i ∈ R m represents the connectivity preservation control input. The role of this control input is to enforce that, if the communication graph is initially connected, then it will remain connected as the system evolves. • The term u d i ∈ R m represents the desired control action. This encodes 300 the coordinated objective that the multi-robot system needs to achieve.
In this paper, we consider the objective to be the uniform coverage of a given area.
They define the relative importance of the separate control laws.

305
It is worth noting that the overall behavior of the multi-robot system is defined by the way in which each individual control action is defined and by how they are combined. Indeed, a different choice of the linear combination gains leads to a different behavior of the multi-robot system.
In the following subsections, we introduce the individual control actions 310 which are considered for implementation in the rest of the paper.

Connectivity preservation
The connectivity preservation control term u c i is designed, as in [4], to ensure that the value of the algebraic connectivity λ never goes below a given threshold > 0. As in [4], the following energy function can be used for generating the decentralized connectivity maintenance control strategy: The control law is designed to drive the robots to perform a gradient descent of V (·), which ensures preservation of the graph connectivity. Considering the robot model introduced in (6), the control law is defined as follows: 13 J o u r n a l P r e -p r o o f

Journal Pre-proof
We observe that the connectivity preservation framework can be enhanced to consider also additional objectives. In particular, as shown in [38], the concept of generalized connectivity can be utilized to simultaneously guarantee connec-315 tivity maintenance and collision avoidance with environmental obstacles and among the robots.

Topology resilience improvement
The topology resilience improvement control term u r i is designed-in accordance with the methodology defined in [46,10]-to drive the robots toward an Considering the robot model introduced in (6), the control law is defined as follows: where x i β ∈ R m is the barycenter of the positions of the robots in P ath β (i) (see Equation (4) for its computation) and α ∈ R + is a scalar coefficient setting the velocity magnitude of each robot 2 .

330
Parameter ξ i takes into account the vulnerability state of a node i, i.e., ξ i = 1 if node i identifies itself as vulnerable or ξ i = 0 otherwise. As in [46,10], we set as vulnerable those robots i exhibiting high values for P θ (i): then, ξ i is defined 2 Pathological situations may exist in which (10) is not well defined, namely when p i = x i β . However, this corresponds to the case where the i-th robot is exactly in the barycenter of its weakly connected 2-hop neighbors: in practice, this never happens when a robot detects itself as vulnerable.
14 J o u r n a l P r e -p r o o f Journal Pre-proof as follows where r ∈ (0, 1) is a random number drawn from a uniform distribution, i.e., if P θ (i) > r, then the i-th robot considers itself as vulnerable. It is worth remarking that, according to (5), each robot can evaluate its vulnerability level in a decentralized manner.

Area coverage and collision avoidance 335
To control the robot to evenly spread over a given area while avoiding collisions, we propose to use the well-known control strategy based on the Lennard-Jones potential [14]. At distance x from its origin, the potential equation is: When considering robot i and multiple neighboring robots j's (∈ N (i)), this entails that the desired control action equations can be written as: where parameters ι and δ define the potential function shape and x ij is the inter-robot distance between i and j. Exponents a and b are set to 4 and 2. For the sake of collision avoidance, we set δ to be larger than the communication range of the robots.

Optimized control strategy 340
This section presents the methodology that we used to perform the online optimization of control gains σ i , ψ i , ζ i introduced in Equation (7). The goal is to allow each robot to identify the most appropriate set of parameters as the system evolves.

15
J o u r n a l P r e -p r o o f

Journal Pre-proof
The ideal performance is defined starting from the desired global behaviour, that is, achieving the largest area coverage while keeping a high level of connectivity. For this multi-objective problem, we define the following scalarizing function: where λ 2 (t) is the algebraic connectivity of the communication graph at time 345 t, and A(t) is the value of the covered area at time t (see also Appendix A for discussion on alternative implementations of Equation (14)).
The choice of this scalarizing function is motivated by the fact that we are dealing with a multi-objective problem comprising two performance metrics with different domains and straightforward way to avoid an adaptive normalization 350 scheme is to consider the metrics' product [50]. We also observe that the intent of our work is to optimize the control law for algebraic connectivity λ and area coverage A only. The robustness component u r does not represent an objective per se but rather a hint to the multi-robot system to make it more robust and resilient once faults (imperfect communication and robotic failures) are injected.

355
Since (14) returns the optimal set of gains σ j , ψ j , ζ j . p i (t) represents the position of the We are now left with the task of selecting an optimization methodology that can allow us to find the ideal combination of the gains σ, ψ, ζ such that the objective function introduced in (14) is maximized. We observe that the scalarizing function we selected (according to the considerations made in Appendix A) is the product of nonlinear functions, that is, algebraic connectivity Consequently, we searched among optimization methods that are not too computationally expensive but also well suited for such nonlinear problems. We  (14), numerical differentiation is exploited.

Implementation and evaluation
As we want to compare the optimization algorithms from Subsection 6. As we are in a non-fully connected network, we use a consensus mechanismi.e., virtual stigmergy [19]. Using this shared knowledge, each robot computes 400 the components of the control input of every robot in the team (u c i , u r i and u d i in Equation (7)).
We define as O p ∈ Z + the optimization period and G p ∈ Z + the number of generated points. Every O p control steps, every robot optimizes and updates its own set of gains to be used in (7) as follows: 405 1. A maximum of G p gains-tuples ψ, σ, ζ -are generated by the optimization algorithm (one of those described in Subsection 6.1).

For each tuple, the robots (i) predicts the positions of all other robots at
the subsequent time step integrating (7) and (ii) evaluate the objective function introduced in Equation (14). These steps were implemented using the Buzz scripting language [16], and simulations were run using the multi-physics environment of ARGoS [15]. We evaluated the performance of the three optimization methodologies in a network of eight two-wheeled robots and we compared it against the same robot team using constant gains. We screened the Cartesian product (i.e., all combinations) of the following gain assignments:

420
The results of these simulations are summarized in Table 1 and Figure 1, which presents the evolution of the objective function (14)       This is expected since the number of robots leads to an inevitable increase 470 in the covered area, while the absolute value of algebraic connectivity is not significantly affected by the number of robots for teams of this size. Nonetheless, Table 2 show how the optimizer performs as intended independently of the number of robots in the team and its hyper-parameters.

475
Transitioning from simulation to real robots can be challenging and results in performance degradation, especially with resource constraint hardware [11]. To demonstrate the portability of the proposed online optimization, and to analyze how hardware limitations affect the choice of the optimization parameters (i.e., the generated points G p and optimization period O p ), we used an actual dis-480 tributed multi-robot system to test our methodology. The robot team consists of eight two-wheeled differential-drive K-Team Khepera IV shown in Figure 5.
Each robot is equipped with an 800MHz ARM Cortex-A8 and the linux-based Yocto operating system 4 .
A camera-based tracking system consisting of four OptiTrack 5 Prime13 cam-485 eras (see Figure 5), and the blabbermouth 6 communication software are com-bined to emulate range and bearing sensors for each robot. The communication infrastructure is based on traditional Wi-Fi and, integrating the information from the camera-based positioning system, we emulate communication ranges up to a fixed distance R = 60cm (analogue to the setup used in [11]). All in-490 formation on-board each robot is in local coordinates. OptiTrack sends to every robot the positions of its neighbors in its own local coordinates. The messages that robots send to each other also use the robots' own coordinate system (and, thus, they have to be transformed on board each receiving robot). The optimization procedure described in Section 6.2 is embedded into the 495 Khepera IV-specific virtual machine bzzkh4 7 that is used to execute the Buzz byte code of each robotic controller. Using the parameters studied in simulation as a starting point, we determined the optimization times ∆ t for the on-board processing at the varying of G p . We obtain ∆ t 's of 8 41 , 46 47 and 84 23 as runtimes for 400, 2200 and 4000 generated points G p , respectively. That 500 is, with increasing G p , ∆ t increases linearly and ranges from minutes to hours.
Considering these computational demands, it is sensible to run the online optimization on the Khepera IV every O p = 50 steps with a G p of 250 points (∆ t ∼ 2 ). Simulations and experimental validation iterate over a fix number of control steps. The duration of each experiment is set to 500 and 300 such 505 iterations, respectively, and every experiment was repeated starting from four, randomly selected initial poses. Due to the potentially varying processing times on each robot, the team of Khepera IVs operates asynchronously. The results obtained combining the robotic set-up described in Section 7 and 510 the two fault-injection protocols presented in Appendix B are shown in Figure 6.
The three columns of Figure 6  The most notable result in Figure 6 certainly comes from the central column.
Here, we clearly see how static gains [11] and online optimization produce very different results. In [11], we had noted that the presence of faulty communi-525 cation could lead the robots to favour λ over A, resulting in more compact formations. In Figure 6, this behaviour is remarkably not present and-albeit deteriorated w.r.t. the fault-free scenario-both λ and A increase over time. In fact, the online adjustment of the control gains appears to facilitate the balance between the two objectives. Finally, in the rightmost column, we observe that, 530 in presence of hardware failures, A is predictably and inevitably weakened. Yet, both λ and Θ can be driven up by the proposed approach (note that the larger absolute values are justified by the fact that they refer to progressively smaller networks, with less than eight robots). Table 3 summarizes the results of nine two-tailed, paired t-tests between 535 the initial and final distributions of metrics f obj , λ, and A using the data from

545
In this article, we experimentally evaluated the methodology proposed in [12] (i.e., the online optimization of resilient multi-robot networks) against faults. To evaluate the impact and effectiveness of our scalarizing function choice 765 on the overall system performance, we also run multiple simulations using the following arithmetic combinations of the two performance metrics: 1. The product of the performance metrics λ 2 and A: 2. The sum of the performance metrics λ 2 and A: 3. The normalized sum of the performance metrics (with λ 2−tar. and A tar. set to 2.0 and 5.0, respectively, after preliminary evaluation): We remark that this list is clearly non-exhaustive: one could, for example introduce many more sophisticated scalarizing functions, such as one that evaluates as a step function for λ 2 and linearly (or quadratically) for A. layer of complexity as it requires to run preliminary experiments to estimate the values of λ 2−tar. and A tar. .

785
As our final goal is the evaluate the performance of autonomous, online optimization in the presence of faults, we opted to use Equation 17 as our preferred scalarizing function-following Dijkstra's opinion that complexity can pose a risk to reliability [52].

B. Fault injection 790
In this section, we outline the models and procedures that we used to inject faults within our simulations and experimental setup. Faults are meant to demonstrate manufacturing imperfections and other non-idealities afflicting the 38 J o u r n a l P r e -p r o o f Journal Pre-proof physical world [53]. We use fault-injection to offer a more difficult challenge to the proposed control and optimization methodology. We model the drops of messages as independent phenomena happening on each communication link, at a given rate. The likelihood of a message being dropped is described by a Bernoulli trial with probability mass function pmf Bern : Table 4 reports the values of p in different phases of simulations lasting 500 iterations while Table 5 Table 4: Values of the packet drop rate over the development of each simulation t: 0"-320" 320"-640" 640"-960" 960"-128'0' >1280" p: 0.0 0.2 0.4 0.6 0.8  [55]. In our simulations/experiments, as we did in [11], we use an exponential cumulative distribution function CDF exp is: Hence, the MTTF equals the expected value: E[X] = β. In practice, the injection of robotic failures was implemented as follow: An initial grace period is granted for all robots. After the grace period ends, each robot's lifetime is regu-820 lated by an independent exponential distribution with MTTF of 300 iterations for simulations and ∼16' for experiments (60% of the simulation/experiment duration).
The occurrences of the failures-kept unaltered through all the experiments for each initial configuration-are summarized in Table 6 for simulations and 825 in Table 7 for experiments. After a hard failure, a robot stops moving and communicating, in the simulation case, or it is physically removed from the arena in the experimental case.  Failure time: 7'14" 8'20" 9'45" 15'26" 18'29"