Conference Papers

Enes V, Moreno CB, Almeida PS, Leitão J.  2017.  Borrowing an Identity for a Distributed Counter. PaPoC '17 Proceedings of the 3rd Workshop on the Principles and Practice of Consistency for Distributed Data. a5-enes.pdf
Younes G, Almeida PS, Moreno CB.  2017.  Compact Resettable Counters through Causal Stability. PaPoC '17 Proceedings of the 3rd Workshop on the Principles and Practice of Consistency for Distributed Data. a3-younes.pdf
Gonçalves R, Almeida PS, Moreno CB, Fonte V.  2017.  DottedDB: Anti-Entropy without Merkle Trees, Deletes without Tombstones. 36th IEEE International Symposium on Reliable Distributed Systems . Abstractdotteddb_srds.pdf

To achieve high availability in the face of network partitions, many distributed databases adopt eventual consistency, allow temporary conflicts due to concurrent writes, and use some form of per-key logical clock to detect and resolve such conflicts. Furthermore, nodes synchronize periodically to ensure replica convergence in a process called anti-entropy, normally using Merkle Trees. We present the design of DottedDB, a Dynamo-like key-value store, which uses a novel node-wide logical clock framework, overcoming three fundamental limitations of the state of the art: (1) minimize the metadata per key necessary to track causality, avoiding its growth even in the face of node churn; (2) correctly and durably delete keys, with no need for tombstones; (3) offer a lightweight anti-entropy mechanism to converge replicated data, avoiding the need for Merkle Trees. We evaluate DottedDB against MerkleDB, an otherwise identical database, but using per-key logical clocks and Merkle Trees for anti-entropy, to precisely measure the impact of the novel approach. Results show that: causality metadata per object always converges rapidly to only one id-counter pair; distributed deletes are correctly achieved without global coordination and with constant metadata; divergent nodes are synchronized faster, with less memory-footprint and with less communication overhead than using Merkle Trees.

Kassam Z, Shoker A, Almeida PS, Moreno CB.  2017.  Aggregation Protocols in Light of Reliable Communication. The 16th IEEE International Symposium on Network Computing and Applications (NCA 2017). Aggregation Protocols in Light of Reliable Communication
Moreno CB, Almeida PS, Lerche C.  2016.  The problem with embedded CRDT counters and a solution. PaPoC '16 Proceedings of the 2nd Workshop on the Principles and Practice of Consistency for Distributed Data. abstractcounterpapocfinal.pdf
Almeida PS, Shoker A, Moreno CB.  2015.  Efficient State-based CRDTs by Delta-Mutation. In the Proceedings of the International Conference on NETworked sYStems}. 9466 Abstractdeltacrdt.pdf

CRDTs are distributed data types that make eventual consistency of a distributed object possible and non ad-hoc. Specifically, state-based CRDTs ensure convergence through disseminating the entire state, that may be large, and merging it to other replicas; whereas operation-based CRDTs disseminate operations (i.e., small states) assuming an exactly-once reliable dissemination layer. We introduce Delta State Conflict-Free Replicated Datatypes (δ-CRDT) that can achieve the best of both worlds: small messages with an incremental nature, as in operation-based CRDTs, disseminated over unreliable communication channels, as in traditional state-based CRDTs. This is achieved by defining δ-mutators to return a delta-state, typically with a much smaller size than the full state, that is joined to both: local and remote states. We introduce the δ-CRDT framework, and we explain it through establishing a correspondence to current state-based CRDTs. In addition, we present an anti-entropy algorithm that ensures causal consistency, and we introduce two δ-CRDT specifications of well-known replicated datatypes.

Almeida PS, Moreno CB, Fonte V, Gonçalves R.  2015.  Concise Server-Wide Causality Management for Eventually Consistent Data Stores. DAIS - IFIP International Conference on Distributed Applications and Interoperable Systems. Abstractglobal_logical_clocks.pdf

Large scale distributed data stores rely on optimistic replication to scale and remain highly available in the face of network partitions. Managing data without coordination results in eventually consistent data stores that allow for concurrent data updates. These systems often use anti-entropy mechanisms (like Merkle Trees) to detect and repair divergent data versions across nodes. However, in practice hash-based data structures are too expensive for large amounts of data and create too many false conflicts. Another aspect of eventual consistency is detecting write conflicts. Logical clocks are often used to track data causality, necessary to detect causally concurrent writes on the same key. However, there is a nonnegligible metadata overhead per key, which also keeps growing with time, proportional with the node churn rate. Another challenge is deleting keys while respecting causality: while the values can be deleted, perkey metadata cannot be permanently removed without coordination. We introduce a new causality management framework for eventually consistent data stores, that leverages node logical clocks (Bitmapped Version Vectors) and a new key logical clock (Dotted Causal Container) to provides advantages on multiple fronts: 1) a new efficient and lightweight anti-entropy mechanism; 2) greatly reduced per-key causality metadata size; 3) accurate key deletes without permanent metadata.

Almeida PS, Moreno CB, Gonçalves R, Preguiça N, Fonte V.  2014.  Scalable and Accurate Causality Tracking for Eventually Consistent Stores. Distributed Applications and Interoperable Systems. 8460 Abstractdvvset-dais.pdf

In cloud computing environments, data storage systems often rely on optimistic replication to provide good performance and availability even in the presence of failures or network partitions. In this scenario, it is important to be able to accurately and efficiently identify updates executed concurrently. Current approaches to causality tracking in optimistic replication have problems with concurrent updates: they either (1) do not scale, as they require replicas to maintain information that grows linearly with the number of writes or unique clients; (2) lose information about causality, either by removing entries from client-id based version vectors or using server-id based version vectors, which cause false conflicts. We propose a new logical clock mechanism and a logical clock framework that together support a traditional key-value store API, while capturing causality in an accurate and scalable way, avoiding false conflicts. It maintains concise information per data replica, only linear on the number of replica servers, and allows data replicas to be compared and merged linear with the number of replica servers and versions.

Moreno CB, Almeida PS, Shoker A.  2014.  Making Operation-Based CRDTs Operation-Based. The International Conference on Distributed Applications and Interoperable Systems - DAIS. 8460:126–140. AbstractBest paper (joint)

Conflict-free Replicated Datatypes (CRDT) can simplify the design of eventually consistent systems. They can be classified into state- based or operation-based. Operation-based designs have the potential for allowing very compact solutions in both the sent messages and the object state size. Unfortunately, the current approaches are still far from this objective. In this paper, we introduce a new ‘pure’ operation-based frame- work that makes the design and the implementation of these CRDTs more simple and efficient. We show how to leverage the meta-data of the messaging middleware to design very compact CRDTs, while only disseminating operation names and their optional arguments.

Moreno CB, Almeida PS, Borges M, Jesus P.  2012.  Spectra: Robust estimation of distribution functions in networks. Distributed Applications and Interoperable Systems. 7272:96–103. Abstract1204.1373.pdf

The distributed aggregation of simple aggregates such as minima/maxima, counts, sums and averages have been studied in the past and are important tools for distributed algorithms and network co- ordination. Nonetheless, this kind of aggregates may not be comprehen- sive enough to characterize biased data distributions or when in presence of outliers, making the case for richer estimates.
This work presents Spectra, a distributed algorithm for the estimation of distribution functions over large scale networks. The estimate is available at all nodes and the technique depicts important properties: robustness when exposed to high levels of message loss, fast convergence speed and fine precision in the estimate. It can also dynamically cope with changes of the sampled local property and with churn, without requiring restarts. The proposed approach is experimentally evaluated and contrasted to a competing state of the art distribution aggregation technique.

Moreno CB, Almeida PS, Fonte V, Preguiça N, Gonçalves R.  2012.  Brief announcement: efficient causality tracking in distributed storage systems with dotted version vectors. Proceedings of the symposium on Principles of distributed computing - PODC. :335–336. Abstractp335-preguica.pdf

Version vectors (VV) are used pervasively to track dependencies between replica versions in multi-version distributed storage systems. In these systems, VV tend to have a dual functionality: identify a version and encode causal dependencies. In this paper, we show that by maintaining the identifier of the version separate from the causal past, it is possible to verify causality in constant time (instead of O(n) for VV) and to precisely track causality with information with size bounded by the degree of replication, and not by the number of concurrent writers.

Almeida PS, Moreno CB, Cunha A.  2012.  A Fast Distributed Computation of Distances in Networks. IEEE 51st Annual Conference on Decision and Control - CDC. :5215–5220. Abstractcdc12-1.pdf

This paper presents a distributed algorithm to simultaneously compute the diameter, radius and node eccentricity in all nodes of a synchronous network. Such topological information may be useful as input to configure other algorithms. Previous approaches have been modular, progressing in sequential phases using building blocks such as BFS tree construction, thus incurring longer executions than strictly required. We present an algorithm that, by timely propagation of available estimations, achieves a faster convergence to the correct values. We show local criteria for detecting convergence in each node. The algorithm avoids the creation of BFS trees and simply manipulates sets of node ids and hop counts. For the worst scenario of variable start times, each node i with eccentricity ecc(i) can compute: the node eccentricity in diam(G)+ecc(i)+2 rounds; the diameter in 2 diam(G)+ecc(i)+2 rounds; and the radius in diam(G) + ecc(i) + 2 radius(G) rounds.

Almeida PS, Moreno CB, Farach-Colton M, Jesus P, Mosteiro M.  2011.  Fault-Tolerant Aggregation: Flow-Updating Meets Mass-Distribution. Principles of Distributed Systems. 7109:513–527. Abstract1109.4373v1.pdf

Flow-Updating (FU) is a fault-tolerant technique that has proved to be efficient in practice for the distributed computation of aggregate functions in communication networks where individual processors do not have access to global information. Previous distributed aggregation protocols, based on repeated sharing of input values (or mass) among processors, sometimes called Mass-Distribution (MD) protocols, are not resilient to communication failures (or message loss) because such failures yield a loss of mass.

In this paper, we present a protocol which we call Mass-Distribution with Flow-Updating (MDFU). We obtain MDFU by applying FU techniques to classic MD. We analyze the convergence time of MDFU showing that stochastic message loss produces low overhead. This is the first convergence proof of an FU-based algorithm. We evaluate MDFU experimentally, comparing it with previous MD and FU protocols, and verifying the behavior predicted by the analysis. Finally, given that MDFU incurs a fixed deviation proportional to the message-loss rate, we adjust the accuracy of MDFU heuristically in a new protocol called MDFU with Linear Prediction (MDFU-LP). The evaluation shows that both MDFU and MDFU-LP behave very well in practice, even under high rates of message loss and even changing the input values dynamically.

Gonçalves R, Almeida PS, Moreno CB, Fonte V, Preguiça N.  2011.  Evaluating Dotted Version Vectors in Riak. INForum - Simpósio de Informática. :474–479. Abstractmain.pdf

The NoSQL movement is rapidly increasing in importance, acceptance and usage in major (web) applications, that need the partition-tolerance and availability of the CAP theorem for scalability purposes, thus sacrificing the consistency side. With this approach, paradigms such as Eventual Consistency became more widespread. An eventual consistent system must handle data divergence and conflicts, that have to be carefully accounted for. Some systems have tried to use classic Version Vectors (VV) to track causality, but these reveal either scalability problems or loss of accuracy (when pruning is used to prevent vector growth).

Dotted Version Vectors (DVV) is a novel mechanism for dealing with data versioning in eventual consistent systems, that allows both accurate causality tracking and scalability both in the number of clients and servers, while limiting vector size to replication degree.

In this paper we describe briefly the challenges faced when incorporating DVV in Riak (a distributed key-value store), evaluate its behavior and performance, and discuss the advantages and disadvantages of this specific implementation.

Borges M, Moreno CB, Jesus P, Almeida PS.  2011.  Estimativa Contínua e Tolerante a Faltas de Funções Distribuição Cumulativa em Redes de Larga Escala. INForum - Simpósio de Informática. Abstractsimposio_de_informatica_inforum_2011_borges.pdf

Em ambientes descentralizados de larga escala como é o caso das redes de sensores sem fios, P2P e outras, a recolha de dados é praticável apenas se houver agregação dos mesmos. No estado da arte actual, as estratégias de resolução desta questão não são satisfatórias dado exigirem que os protocolos sejam reiniciados sempre que os valores iniciais mudam ou quando o rácio de entrada/saída de nodos é não nulo. Acresce que nenhuma estratégia satisfaz a estimativa de CDFs (funções de dis- tribuição cumulativa). A abordagem apresentada neste trabalho mostra como é possível o cálculo de funções de distribuição cumulativa em redes distribuídas, permitindo o seguimento dinâmico dos valores amostrados, sem que seja necessário reiniciar o protocolo em condições adversas de funcionamento da rede (perda de mensagens, entrada/saída de nodos). Esta abordagem é baseada num protocolo de averaging tolerante a faltas. Com este trabalho pretende-se também contribuir com uma estratégia que reduz o custo de comunicação entre nodos, com base em decisões locais e sensível à taxa de variação de estimativas. Os resultados de simulação mostram a resiliência deste protocolo, permitindo a estimativa contínua de CDFs em presença de dinamismo (perda de mensagens, alteração de valor amostrado, churn). Mostram também a convergência rápida na estimativa de CDFs para diferentes configurações de rede.

Jesus P, Moreno CB, Almeida PS.  2010.  Fault-Tolerant Aggregation for Dynamic Networks. 29th Symposium on Reliable Distributed Systems (SRDS). :37–43. Abstractfu_dynamic_agg.pdf

Data aggregation is a fundamental building block of modern distributed systems. Averaging based approaches, commonly designated gossip-based, are an important class of aggregation algorithms as they allow all nodes to produce a result, converge to any required accuracy, and work independently from the network topology. However, existing approaches exhibit many dependability issues when used in faulty and dynamic environments. This paper extends our own technique, Flow Updating, which is immune to message loss, to operate in dynamic networks, improving its fault tolerance characteristics. Experimental results show that the novel version of Flow Updating vastly outperforms previous averaging algorithms, it self adapts to churn without requiring any periodic restart, supporting node crashes and high levels of message loss.

Moreno CB, Almeida PS, Menezes R.  2009.  Fast estimation of aggregates in unstructured networks. Fifth International Conference on Autonomic and Autonomous Systems - ICAS. :88–93. Abstractieeefastfinalicas2009.pdf

Aggregation of data values plays an important role on distributed computations, in particular over peer-to-peer and sensor networks, as it can provide a summary of some global system property and direct the actions of self-adaptive distributed algorithms. Examples include using estimates of the network size to dimension distributed hash tables or estimates of the average system load to direct loadbalancing. Distributed aggregation using non-idempotent functions, like sums, is not trivial as it is not easy to prevent a given value from being accounted for multiple times; this is especially the case if no centralized algorithms or global identifiers can be used.This paper introduces Extrema Propagation, a probabilistic technique for distributed estimation of the sum of positive real numbers. The technique relies on the exchange of duplicate insensitive messages and can be applied in flood and/or epidemic settings, where multi-path routing occurs; it is tolerant of message loss; it is fast, as the number of message exchange steps equals the diameter; and it is fully distributed, with no single point of failure and the result produced at every node.

Jesus P, Moreno CB, Almeida PS.  2009.  Fault-Tolerant Aggregation by Flow Updating. Distributed Applications and Interoperable Systems. 5523:73–86. Abstractflow_updating.pdf

Data aggregation plays an important role in the design of scalable systems, allowing the determination of meaningful system-wide properties to direct the execution of distributed applications. In the particular case of wireless sensor networks, data collection is often only practicable if aggregation is performed. Several aggregation algorithms have been proposed in the last few years, exhibiting different properties in terms of accuracy, speed and communication tradeoffs. Nonetheless, existing approaches are found lacking in terms of fault tolerance. In this paper, we introduce a novel fault-tolerant averaging based data aggregation algorithm. It tolerates substantial message loss (link failures), while competing algorithms in the same class can be affected by a single lost message. The algorithm is based on manipulating flows (in the graph theoretical sense), that are updated using idempotent messages, providing it with unique robustness capabilities. Furthermore, evaluation results obtained by comparing it with other averaging approaches have revealed that it outperforms them in terms of time and message complexity.

Moreno CB, Almeida PS, Cardoso J.  2009.  Probabilistic estimation of network size and diameter. Fourth Latin-American Symposium on Dependable Computing - LADC. :33–40. Abstractartigoladcfinal.pdf

Determining the size of a network and its diameter are important functions in distributed systems, as there are a number of algorithms which rely on such parameters, or at least on estimates of those values. The Extrema Propagation technique allows the estimation of the size of a network in a fast, distributed and fault tolerant manner. The technique was previously studied in a simulation setting where rounds advance synchronously and where there is no message loss.This work presents two main contributions. The first, is the study of the Extrema Propagation technique under asynchronous rounds and integrated in the Network Friendly Epidemic Multicast (NeEM) framework. The second, is the evaluation of a diameter estimation technique associated with the Extrema Propagation. This study also presents a small enhancement the Extrema Propagation in terms of communication cost and points out some other possible enhancements. Results show that there is a clear trade-off between time and communication that must be considered when configuring the protocol—a faster convergence time implies a higher communication cost. Results also show that its possible to reduce the total communication cost by more than 18% using a simple approach. The diameter estimation technique is shown to have a relative error of less than10% even when using a small sample of nodes.

Jesus P, Moreno CB, Almeida PS.  2009.  Using Less Links to Improve Fault-Tolerant Aggregation. 4th Latin-American Symposium on Dependable Computing - LADC. Abstractfast_abstract

Data aggregation plays a basal role in the design of scalable distributed applications, allowing the determination of meaningful system-wide properties to direct the execution of the system. For instance, aggregation can be used to estimate: the size of the network to dimension of Distributed Hash Table (DHT) structures, or to set a quorum in dynamic settings; the average system load to guide local loadbalancing decisions; the total network disk space in a P2P sharing system. In the particular case of Wireless Sensor Networks (WSN), due to energy constraints, data collection is often only practicable if aggregation is performed.

Jesus P, Moreno CB, Almeida PS.  2009.  Dependability in Aggregation by Averaging. INForum - Simpósio de Informática. :457–470. Abstractsimposio_de_informatica_inforum_2009_jesus.pdf

Aggregation is an important building block of modern distributed applications, allowing the determination of meaningful properties (e.g. network size, total storage capacity, average load, majorities, etc.) that are used to direct the execution of the system. In the recent years, several approaches have been proposed to compute aggregation functions on distributed settings, exhibiting different characteristics, in terms of accuracy, time and communication complexity. However, the majority of the existing aggregation algorithms exhibit relevant dependability issues, when prospecting their use in real application environments. In this paper, we reveal some dependability issues of aggregation algorithms based on iterative averaging techniques, giving some directions to solve them. This class of algorithms is considered robust (when compared to common tree-based approaches), being independent from the used routing topology and providing an aggregation result at all nodes. However, their robustness is strongly challenged and their correctness often compromised, when changing the assumptions of their working environment to more realistic ones. The correctness of this class of algorithms relies on the maintenance of a fundamental invariant, commonly designated as mass conservation. We will argue that this main invariant is often broken in practical settings, and that additional mechanisms and modifications are required to maintain it, incurring in some degradation of the algorithms performance. In particular, we discuss the behavior of three representative algorithms (Push-Sum Protocol [1], Push- Pull Gossip protocol [2] and Distributed Random Grouping [3]) under asynchronous and faulty (with message loss and node crashes) environments. More specifically, we propose and evaluate two new versions of the Push-Pull Gossip protocol, which solve its message interleaving problem (evidenced even in a synchronous operation mode).

Almeida PS, Moreno CB, Fonte V.  2008.  Interval tree clocks. 12th International Conference on Principles of Distributed Systems - OPODIS. 5401 Abstractitc.pdf

Causality tracking mechanisms, such as vector clocks and version vectors, rely on mappings from globally unique identifiers to integer counters. In a system with a well known set of entities these ids can be preconfigured and given distinct positions in a vector or distinct names in a mapping. Id management is more problematic in dynamic systems, with large and highly variable number of entities, being worsened when network partitions occur. Present solutions for causality tracking are not appropriate to these increasingly common scenarios. In this paper we introduce Interval Tree Clocks, a novel causality tracking mechanism that can be used in scenarios with a dynamic number of entities, allowing a completely decentralized creation of processes/replicas without need for global identifiers or global coordination. The mechanism has a variable size representation that adapts automatically to the number of existing entities, growing or shrinking appropriately. The representation is so compact that the mechanism can even be considered for scenarios with a fixed number of entities, which makes it a general substitute for vector clocks and version vectors.

Almeida PS, Pinto JS, Vilaça M.  2007.  A Tool for Programming with Interaction Nets. Proceedings of RULE . Abstract07tpin.pdf

This paper introduces INblobs, a visual tool developed at Minho for integrated development with Interaction Nets. Most of the existing tools take as input interaction nets and interaction rules represented in a textual format. INblobs is first of all a visual editor that allows users to edit interaction systems (both interaction nets and interaction rules) graphically, and to convert them to textual notation. This can then be used as input to other tools that implement reduction of nets. INblobs also allows the user to reduce nets within the tool, and includes a mechanism that automatically selects the next active pair to be reduced, following one of the given reduction strategies. The paper also describes other features of the tool, such as the creation of rules from pre-defined templates.

Jesus P, Moreno CB, Almeida PS.  2006.  ID Generation in Mobile Environments. Conference on Mobile and Ubiquitous Systems (CSMU). Abstractjesus-148.pdf

This work is focused in the ID generation problem in mobile environments. We discuss the suitability of traditional mechanisms and techniques to generate IDs in mobile environments. Based on the “Birthday Problem”, we deduced some formulas to evaluate the ID trust that is directly related to the number of entities in the system. The estimation of the system size revels to be the main problem of our approach. To deal with it, we develop a recursive schema that needs to be evaluated. Alternatively, we also design an aggregation algorithm to estimate the system size, which results are currently been analyzed.

Preguiça N, Moreno CB, Martins JL, Shapiro M, Almeida PS, Domingos H, Fonte V, Duarte S, others.  2005.  Few: File management for portable devices. International Workshop on Software Support for Portable Storage (IWSSPS). Abstractfew-iwssps2005-final2.pdf

In recent years, an increasing number of portable devices with large amounts of storage have become widely used. In this paper, we present the early design of the FEW system, a system that aims to ease file management in the new mobile environment. To this end, the system will manage file replicas stored in fixed and portable storage devices. It will provide an automatic mechanism to establish new file replicas by analyzing file system activity. The system will automatically and incrementally synchronize all file replicas exploring the available network connectivity and the availability of portable storage devices. To merge concurrent updates, operational transformation techniques will be used.