Journal Articles

Jesus P, Moreno CB, Almeida PS.  2015.  Flow updating: Fault-tolerant aggregation for dynamic networks. Journal of Parallel and Distributed Computing. 78:53-64. Abstractjpdcfu.pdf

Data aggregation is a fundamental building block of modern distributed systems. Averaging based pproaches, commonly designated gossip-based, are an important class of aggregation algorithms as they allow all nodes to produce a result, converge to any required accuracy, and work independently from the network topology. However, existing approaches exhibit many dependability issues when used in faulty and dynamic environments. This paper describes and evaluates a fault tolerant distributed aggregation technique, Flow Updating, which overcomes the problems in previous averaging approaches and is able to operate on faulty dynamic networks. Experimental results show that this novel approach outperforms previous averaging algorithms; it self-adapts to churn and input value changes without requiring any periodic restart, supporting node crashes and high levels of message loss, and works in asynchronous networks. Realistic concerns have been taken into account in evaluating Flow Updating, like the use of unreliable failure detectors and asynchrony, targeting its application to realistic environments.

Moreno CB, Jesus P, Almeida PS.  2015.  A Survey of Distributed Data Aggregation Algorithms. IEEE Communications Surveys and Tutorials. 17(1):381-404. Abstract1110.0725.pdf

Distributed data aggregation is an important task, allowing the decentralized determination of meaningful global properties, which can then be used to direct the execution of other applications. The resulting values are derived by the distributed computation of functions like Count, Sum, and Average. Some application examples deal with the determination of the network size, total storage capacity, average load, majorities and many others. In the last decade, many different approaches have been proposed, with different trade-offs in terms of accuracy, reliability, message and time complexity. Due to the considerable amount and variety of aggregation algorithms, it can be difficult and time consuming to determine which techniques will be more appropriate to use in specific settings, justifying the existence of a survey to aid in this task. This work reviews the state of the art on distributed data aggregation algorithms, providing three main contributions. First, it formally defines the concept of aggregation, characterizing the different types of aggregation functions. Second, it succinctly describes the main aggregation techniques, organizing them in a taxonomy. Finally, it provides some guidelines toward the selection and use of the most relevant techniques, summarizing their principal characteristics.

D H, D J, D L, Moreno CB.  2014.  Link Community Detection Using Generative Model and Nonnegative Matrix Factorization. 9(1) Abstractjournal.pone_.0086899.pdf

Discovery of communities in complex networks is a fundamental data analysis problem with applications in various domains. While most of the existing approaches have focused on discovering communities of nodes, recent studies have shown the advantages and uses of link community discovery in networks. Generative models provide a promising class of techniques for the identification of modular structures in networks, but most generative models mainly focus on the detection of node communities rather than link communities. In this work, we propose a generative model, which is based on the importance of each node when forming links in each community, to describe the structure of link communities. We proceed to fit the model parameters by taking it as an optimization problem, and solve it using nonnegative matrix factorization. Thereafter, in order to automatically determine the number of communities, we extend the above method by introducing a strategy of iterative bipartition. This extended method not only finds the number of communities all by itself, but also obtains high efficiency, and thus it is more suitable to deal with large and unexplored real networks. We test this approach on both synthetic benchmarks and real-world networks including an application on a large biological network, and compare it with two highly related methods. Results demonstrate the superior performance of our approach over competing methods for the detection of link communities.

Jin D, He D, Yang B, Moreno CB, Hu Q.  2013.  Extending a configuration moedel to find communities in complex networks. Journal of Statistical Mechanics: Theory and Experiment. 2013(9):P09013. Abstract2013.08.11.doc.pdf

Discovery of communities in complex networks is a fundamental data analysis task in various domains. Generative models are a promising class of techniques for identifying modular properties from networks, which has been actively discussed recently. However, most of them cannot preserve the degree sequence of networks, which will distort the community detection results. Rather than using a blockmodel as most current works do, here we generalize a configuration model, namely, a null model of modularity, to solve this problem. Towards decomposing and combining sub-graphs according to the soft community memberships, our model incorporates the ability to describe community structures, something the original model does not have. Also, it has the property, as with the original model, that it fixes the expected degree sequence to be the same as that of the observed network. We combine both the community property and degree sequence preserving into a single unified model, which gives better community results compared with other models. Thereafter, we learn the model using a technique of nonnegative matrix factorization and determine the number of communities by applying consensus clustering. We test this approach both on synthetic benchmarks and on real-world networks, and compare it with two similar methods. The experimental results demonstrate the superior performance of our method over competing methods in detecting both disjoint and overlapping communities.

Liu D, Jin D, Moreno CB, He D, Yang B, Yu Q.  2013.  Genetic Algorithm with a Local Search Strategy for Discovering Communities in Complex Networks. International Journal of Computational Intelligence Systems. 6(2):354-369. Abstract1303.5909.pdf

In order to further improve the performance of current genetic algorithms aiming at discovering communities, a local search based genetic algorithm GALS is here proposed. The core of GALS is a local search based mutation technique. In order to overcome the drawbacks of traditional mutation methods, the paper develops the concept of marginal gene and then the local monotonicity of modularity function Q is deduced from each nodes local view. Based on these two elements, a new mutation method combined with a local search strategy is presented. GALS has been evaluated on both synthetic benchmarks and several real networks, and compared with some presently competing algorithms. Experimental results show that GALS is highly effective and efficient for discovering community.

Moreno CB, Almeida PS, Menezes R, Jesus P.  2012.  Extrema propagation: Fast Distributed Estimation of Sums and Network Sizes. IEEE Transactions on Parallel and Distributed Systems. 23(4):668–675. Abstract10.1.1.65.6474.pdf

Aggregation of data values plays an important role on distributed computations, in particular, over peer-to-peer and sensor networks, as it can provide a summary of some global system property and direct the actions of self-adaptive distributed algorithms. Examples include using estimates of the network size to dimension distributed hash tables or estimates of the average system load to direct load balancing. Distributed aggregation using nonidempotent functions, like sums, is not trivial as it is not easy to prevent a given value from being accounted for multiple times; this is especially the case if no centralized algorithms or global identifiers can be used. This paper introduces Extrema Propagation, a probabilistic technique for distributed estimation of the sum of positive real numbers. The technique relies on the exchange of duplicate insensitive messages and can be applied in flood and/or epidemic settings, where multipath routing occurs; it is tolerant of message loss; it is fast, as the number of message exchange steps can be made just slightly above the theoretical minimum; and it is fully distributed, with no single point of failure and the result produced at every node.

Jin D, Yang B, Moreno CB, Liu D, He D, Liu J.  2011.  A Markov random walk under constraint for discovering overlapping communities in complex networks. Journal of Statistical Mechanics: Theory and Experiment. 2011(P05031):21. Abstract1303.5675.pdf

Detection of overlapping communities in complex networks has motivated recent research in the relevant fields. Aiming this problem, we propose a Markov dynamics based algorithm, called UEOC, which means, 'unfold and extract overlapping communities'. In UEOC, when identifying each natural community that overlaps, a Markov random walk method combined with a constraint strategy, which is based on the corresponding annealed network (degree conserving random network), is performed to unfold the community. Then, a cutoff criterion with the aid of a local community function, called conductance, which can be thought of as the ratio between the number of edges inside the community and those leaving it, is presented to extract this emerged community from the entire network. The UEOC algorithm depends on only one parameter whose value can be easily set, and it requires no prior knowledge on the hidden community structures. The proposed UEOC has been evaluated both on synthetic benchmarks and on some real-world networks, and was compared with a set of competing algorithms. Experimental result has shown that UEOC is highly effective and efficient for discovering overlapping communities.

Shapiro M, Preguiça N, Moreno CB, Zawirsky M, Fatourou P, others.  2011.  Convergent and Commutative Replicated Data Types. Bulletin of the European Association for Theorical Computer Science. :67–88. Abstract120-477-1-pb.pdf

Eventual consistency aims to ensure that replicas of some mutable shared object converge without foreground synchronisation. Previous approaches to eventual consistency are ad-hoc and error-prone. We study a principled approach: to base the design of shared data types on some simple formal conditions that are sufficient to guarantee eventual consistency. We call these types Convergent or Commutative Replicated Data Types (CRDTs). This paper formalises asynchronous object replication, either state based or operation based, and provides a sufficient condition appropriate for each case. It describes several useful CRDTs, including container data types supporting both add and remove operations with clean semantics, and more complex types such as graphs and monotonic DAGs. It discusses some properties needed to implement non-trivial CRDTs.

Lopes N, Moreno CB.  2010.  Building Inverted Indexes Using Balanced Trees Over DHT Systems. EuroSys 2006 Poster. Abstract10.1.1.163.9683.pdf

Distributed Hash Table (DHT) systems are scalable and efficient data structures for object storage and location using a simple put/get interface. These systems place objects over a very large set of hosts using a multitude of algorithms in order to distribute objects uniformly among hosts using logarithmic (or lower) costs for routing table sizes and message hops [1, 2]. However, these systems assume that object size (storage load) and popularity (communication load) follow an uniform distribution. When unbalanced data is used on a DHT, hotspots are created at some specific (random) hosts. Although one might argue that storage is not a critical resource, due to the current trend on secondary storage capacity, storing such large objects creates network bottlenecks, which in turn may limit data availability.

Almeida PS, Moreno CB, Preguiça N, Hutchison D.  2007.  Scalable bloom filters. Information Processing Letters. 101:255–261. Abstractdbloom.pdf

Bloom Filters provide space-efficient storage of sets at the cost of a probability of false positives on membership queries. The size of the filter must be defined a priori based on the number of elements to store and the desired false positive probability, being impossible to store extra elements without increasing the false positive probability. This leads typically to a conservative assumption regarding maximum set size, possibly by orders of magnitude, and a consequent space waste. This paper proposes Scalable Bloom Filters, a variant of Bloom Filters that can adapt dynamically to the number of elements stored, while assuring a maximum false positive probability.

Almeida PS, Almeida PS, Moreno CB.  2004.  Bounded version vectors. Distributed Computing - Lecture Notes in Computer Science. 3274:102–116. Abstractdisc04.pdf

Version vectors play a central role in update tracking under optimistic distributed systems, allowing the detection of obsolete or inconsistent versions of replicated data. Version vectors do not have a bounded representation; they are based on integer counters that grow indefinitely as updates occur. Existing approaches to this problem are scarce; the mechanisms proposed are either unbounded or operate only under specific settings. This paper examines version vectors as a mechanism for data causality tracking and clarifies their role with respect to vector clocks. Then, it introduces bounded stamps and proves them to be a correct alternative to integer counters in version vectors. The resulting mechanism, bounded version vectors, represents the first bounded solution to data causality tracking between replicas subject to local updates and pairwise symmetrical synchronization.

Moreno CB, Lopes N.  2003.  Towards peer-to-peer content indexing. Operating Systems Review. 37:90–96. Abstractp2pcontent.pdf

Distributed Hash Tables are the core technology on a significant share of system designs for Peer-to-Peer information sharing. Typically, a location mechanism is provided and object identifiers act as keys in the index of object locations. When introducing a search mechanism, where single words are used as keys, the key image cardinality will be driven by the word popularity and most of the present designs will be unable to load balance the index among the nodes. We present two contributions: A design that allows participating nodes to load balance the indexing of popular keys and avoid content hot-spots on single nodes; A distributed mechanism for probabilistic filtering of popular keys (with low search relevance) that paves the way for scalable full content indexing.

Moreno CB, Moura F.  1999.  Using structural characteristics for autonomous operation. Operating Systems Review. 33:90–96. Abstractstructural.pdf

The majority of current mobile computing systems operate either in conjunction with a central network by some form of weak connectivity or tend to operate in total isolation and perform sporadic synchronization with a backup or a central network. These configurations miss an additional and very useful pattern of operation --- mobile to mobile interaction. Recent mobile devices have the capacity for direct communication among them, but this option is essentially neglected by the application software.In order to address this pattern of operation we believe that there is a need to support re-usable peer-to-peer synchronization mechanisms that both respects data ownership and enables some level of state reconciliation.Naming this operation pattern as autonomous operation, we can observe that this pattern is already found on many legacy applications deployed in distributed systems. For example, personal information managers, Mail/News readers and Web browsers, often store persistent state in local files, but tacitly assume a single copy. Noticing that these separate copies are in fact replicas of a distributed entity, leads to the creation of semantically knowledgeable file synchronizers that strive to restore an unified state from these replicas.Evolution from static distributed systems to mobile platforms raises a demand for applications that, not only are adapted to user mobility but, take advantage of it. It is clear that despite continuous improvements on connectivity support for mobile environments, the cost and coverage limits still imply a major share of disconnected operation. When connectivity does exist it usually interposes wide area networks between communication peers, when one party is on the road, leading to lower channel quality. On the other hand, user mobility is likely to conduce to, normally unforeseen, physical proximity of the user's mobile computer with other mobile or fixed systems. This occurrence is likely to increase as the installed population of mobile devices increases.In this work we show that without imposing restrictions on availability, which is a crucial factor for personal applications, it is possible to enable some data sharing among autonomous mobile applications. This sharing would take advantage of any pairwise encounters of replica holders.To determine the level of sharing that is compatible with permanent availability, we model general purpose data types that provide the necessary reconciliation guarantees. These guarantees are obtained by placing restrictions on the allowed behavior in order to avoid the occurrence of conflicting concurrent operations that would prevent reconciliations. Among other uses, these data types should help to identify sharable segments of data on classes of applications that traditionally support no sharing at all, and identify which parts of the state can be effectively shared.In the next section we present some examples of sharable data that motivates the modeling of a more generic and higher level description. This description is presented in the third section together with a framework of convergent components. Section four builds on this framework and gives a general presentation of a Java implementation for a component hierarchy. Before presenting the conclusions we show how these tools where used to build a merger for pairs of bookmark files, giving some insight on how to combine the components to create concrete applications.

Sousa AL, Coutinho A, Moreno CB, Moura F, Oliveira JP, Pereira JO.  1999.  Broms : Gestão Uniforme de um Parque Computacional Multi-Plataforma. Ingenium. Abstractbroms.pdf

O crescimento dos parques de máquinas pessoais levanta consideráveis problemas de administração, contrastando com o que o ocorre com recursos centralizados. Nenhuma das soluções existentes para o efeito apresenta um compromisso aceitável entre a liberdade de configuração que se espera de uma máquina pessoal e o controlo eficiente dos recursos resultante de uma gestão centralizada. Neste contexto propõe se uma solução deste dilema através da coordenação de um sistema de boot remoto avançado com um conjunto de serviços de rede. A aplicação deste sistema à gestão e manutenção de laboratórios pedagógicos demonstrou que se pode assim criar um ambiente de ensino muito mais fiável e flexível do que o tradicional.

Moreno CB, Moura F.  1998.  Improving causality logging in mobile computing networks. Mobile Computing and Communications Review. 2:62–66. Abstract10.1.1.41.8136.pdf

This paper builds on previous techniques for efficient causality logging in mobile networks and presents a lighter logging mechanism. The technique is based on a particular partial order that is generated by the interleaving of events on mobile hosts that are mediated by the same support station. I. Introduction Mobile computing systems are frequently designed as a network of fixed nodes, mobile support stations (MSS), that give connectivity to a set of mobile hosts (MH) [1]. The MSSs are interconnected by a high bandwidth wire-line network where communication costs are inexpensive. In contrast, the MHs always communicate with the mediation of a hosting MSS using a low-bandwidth wireless or phone channel where costs are at issue. This class of mobile computing systems, although excluding direct communication among MHs, models a vast range of existing systems that include wireless networks of MHs bound to local cells, and nomadic MHs that bind to different MSSs as access points to a wide area network. Distributed applications that build on this class of mobile computing systems are often modeled as a set of concurrent activities distributed among different MHs. Tracking the causal relationships among these concurrent activities is a basic mechanism for the analysis and debugging of distributed applications and a step towards the design of message delivery and replica consistency policies. It is well established [7] that in a distributed system the causal dependency can be fully characterized by the use of vector clocks [7, 2, 5]. However vector clocks are very sensitive to scalability issues since the vector size is bound to the number of activities, which are here bound to the number of MHs. Clearly, a dependency tracking mechanism that is bound to the number of MSSs an.

Moreno CB, Moura F.  1994.  Concurrency annotations in C++. ACM SigPlan Notices. 29:61–67. Abstract10.1.1.41.4693.pdf

This paper describes CA/C++, Concurrency Annotations in C++, a language extension that regulates method invocations from multiple threads of execution in a shared-memory multiprocessor system. This system provides threads as an orthogonal element to the language, allowing them to travel through more than one object. Statically type-ckecked synchronous and asynchronous method invocations are supported, with return values from asynchronous invocations accessed through first class future-like objects. Method invocations are regulated with synchronization code defined in a separate class hierarchy, allowing separate definition and inheritance of synchronization mechanisms. Each method is protected by an access flag that can be switched in pre and post-actions, and by a predicate. Both must evaluate to true in order to enable a thread to animate the method code. Flags and method predicates are independently redefinable along the inheritance chain, thus avoiding the inheritance anomaly.