Correia A, Pereira JO, Rodrigues L, Oliveira R, Carvalho N.  2010.  Practical Database Replication. Replication: Theory and Practice. Abstract

This chapter illustrates how the concepts and algorithms described earlier in this book can be used to build practical database replication systems. This is achieved first by addressing architectural challenges on how required functionality is provided by generally available software componentes and then how different components can be efficiently integrated. A second set of practical challenges arises from experience on how performance assumptions map to actual environments and real workloads. The result is a generic architecture for replicated database management systems, focusing on the interfaces between key components, and then on how different algorithmic and practical optimization options map to real world gains. This shows how consistent database replication is achievable in the current state of the art.

Leitão J, Pereira JO, Rodrigues L.  2010.  Gossip-based broadcast. Handbook of Peer-to-Peer Networking. Abstract

Gossip, or epidemic, protocols have emerged as a powerful strategy to implement highly scalable and resilient reliable broadcast primitives on large scale peer-to-peer networks. Epidemic protocols are scalable because they distribute the load among all nodes in the system and resilient because they have an intrinsic level of redundancy that masks node and network failures. This chapter provides an introduction to gossip-based broadcast on large-scale unstructured peer-to-peer overlay networks: it surveys the main results in the field, discusses techniques to build and maintain the overlays that support efficient dissemination strategies, and provides an in-depth discussion and experimental evaluation of two concrete protocols, named HyParView and Plumtree.

Leitão J, Carvalho N, Pereira JO, Oliveira R, Rodrigues L.  2010.  On Adding Structure to Unstructured Overlay Networks. Handbook of Peer-to-Peer Networking. Abstract

Unstructured peer-to-peer overlay networks are very resilient to churn and topology changes, while requiring little maintenance cost. Therefore, they are an infrastructure to build highly scalable large-scale services in dynamic networks. Typically, the overlay topology is defined by a peer sampling service that aims at maintaining, in each process, a random partial view of peers in the system. The resulting random unstructured topology is suboptimal when a specific performance metric is considered. On the other hand, structured approaches (for instance, a spanning tree) may optimize a given target performance metric but are highly fragile. In fact, the cost for maintaining structures with strong constraints may easily become prohibitive in highly dynamic networks. This chapter discusses different techniques that aim at combining the advantages of unstructured and structured networks. Namely we focus on two distinct approaches, one based on optimizing the overlay and another based on optimizing the gossip mechanism itself.

Carvalho N, Oliveira JP, Pereira JO.  2009.  Evaluating Throughput Stability of Protocols for Distributed Middleware. On the Move to Meaningful Internet Systems: OTM 2009. :600–613. Abstract

Communication of large data volumes is a core functionality of distributed systems middleware, namely, for interconnecting components, for distributed computation and for fault tolerance. This common functionality is how- ever achieved in different middleware platforms with various combinations of operating system and application level protocols, both standardized and ad hoc, and including implementations on managed runtime environments such as Java.
In this paper, in contrast with most previous work that focus on performance, we point out that architectural and implementation decisions have an impact in throughput stability when the system is heavily loaded, precisely when such sta- bility is most important. In detail, we present an experimental evaluation of several communication protocol components under stress conditions and conclude on the relative merits of several architectural options.

Pereira JO, Rodrigues L, Oliveira R.  2002.  Semantically Reliable Broadcast: Sustaining High Throughput in Reliable Distributed Systems. Concurrency in Dependable Computing. Abstract

Replicated services are often required to sustain high loads of multiple concurrent requests. This requirement is hard to balance with strong consistency. Typically, to ensure inter-replica consistency, all replicas should receive all updates. Unfortunately, in this case, a single slow replica may degrade the performance of the whole system. This paper proposes a novel reliable broadcast primitive that uses semantic knowledge to weaken reliable delivery guarantees while, at the same time, ensuring strong consistency at the semantic level. By allowing some obsolete messages to be dropped, the protocol that implements this primitive is able to sustain a higher throughput than a fully reliable broadcast protocol. The usefulness of the primitive and the performance of the protocol are illustrated through a concrete example.

Coelho F, Paulo J, Vilaça R, Pereira JO, Oliveira R.  2017.  HTAPBench: Hybrid Transactional and Analytical Processing Benchmark. Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering. :293–304. Abstract
Gonçalves RC, Pereira JO, Jimenez-Peris R.  2016.  An RDMA Middleware for Asynchronous Multi-stage Shuffling in Analytical Processing. DAIS '16: Proceedings of the 16th IFIP International Conference on Distributed Applications and Interoperable Systems. Abstract

A key component in large scale distributed analytical processing is shuffling, the distribution of data to multiple nodes such that the computation can be done in parallel. In this paper we describe the design and implementation of a communication middleware to support data shuffling for executing multi-stage analytical processing operations in parallel. The middleware relies on RDMA (Remote Direct Memory Access) to provide basic operations to asynchronously exchange data among multiple machines. Experimental results show that the RDMA-based middleware developed can provide a 75 % reduction of the costs of communication operations on parallel analytical processing tasks, when compared with a sockets middleware.

Cruz F, Maia F, Matos M, Oliveira R, Paulo J, Pereira JO, Vilaça R.  2016.  Resource Usage Prediction in Distributed Key-Value Datastores. Distributed Applications and Interoperable Systems: 16th IFIP WG 6.1 International Conference, DAIS 2016, Held as Part of the 11th International Federated Conference on Distributed Computing Techniques, DisCoTec 2016, Heraklion, Crete, Greece, June 6-9, 2. :144–159. Abstract


Kolev B, Bondiombouy C, Valduriez P, Jimenez-Peris R, Pau R, Pereira JO.  2016.  The CloudMdsQL Multistore System. Proceedings of the 2016 International Conference on Management of Data. :2113–2116. Abstract11.pdf


Kolev B, Bondiombouy C, Levchenko O, Valduriez P, Jimenez-Peris R, Pau R, Pereira JO.  2016.  Design and Implementation of the CloudMdsQL Multistore System. Proceedings of the 6th International Conference on Cloud Computing and Services Science. :352-359. Abstract10.pdf


Jorge T, Maia F, Matos M, Pereira JO, Oliveira R.  2015.  Practical Evaluation of Large Scale Applications. Proceedings of the 15th IFIP International Conference on Distributed Applications and Interoperable Systems (DAIS). :124–137. Abstractdais15minhajsr.pdf

Designing and implementing distributed systems is a hard endeavor, both at an abstract level when designing the system, and at a concrete level when implementing, debugging and evaluating it. This stems not only from the inherent complexity of writing and reasoning about distributed software, but also from the lack of tools for testing and evaluating it under realistic conditions. Moreover, the gap between the protocols’ specifications found on research papers and their implementations on real code is huge, leading to inconsistencies that often result in the implementation no longer following the specification. As an example, the specification of the popular Chord DHT comprises a few dozens of lines, while its Java implementation, OpenChord, is close to twenty thousand lines, excluding libraries. This makes it hard and error prone to change the implementation to reflect changes in the specification, regardless of programmers’ skill. Besides, critical behavior due to the unpredictable interleaving of operations and network uncertainty, can only be observed on a realistic setting, limiting the usefulness of simulation tools. We believe that being able to write an algorithm implementation very close to its specification, and evaluating it in a real environment is a big step in the direction of building better distributed systems. Our approach leverages the MINHA platform to offer a set of built in primitives that allows one to program very close to pseudo-code. This high level implementation can interact with off-the-shelf existing middleware and can be gradually replaced by a production-ready Java implementation. In this paper, we present the system design and showcase it using a well-known algorithm from the literature.

Guimarães P, Pereira JO.  2015.  X-Ray: Monitoring and analysis of distributed database queries. Distributed Applications and Interoperable Systems (DAIS, with DisCoTec). Abstractgp15.pdf

The integration of multiple database technologies, including both SQL and NoSQL, allows using the best tool for each aspect of a complex problem and is increasingly sought in practice. Unfortunately, this makes it difficult for database developers and administrators to obtain a clear view of the resulting composite data processing paths, as they combine operations chosen by different query optimisers, implemented by different software packages, and partitioned across distributed systems.

This work addresses this challenge with the X-Ray framework, that allows monitoring code to be added to a Java-based distributed system by manipulating its bytecode at runtime. The resulting information is collected in a NoSQL database and then processed to visualise data processing paths as required for optimising integrated database systems. This proposal is demonstrated with a distributed query over a federation of Apache Derby database servers and its performance evaluated with the standard TPC-C benchmark workload.

Neves F, Pereira JO, Vilaça R, Oliveira R.  2015.  Otimização do HBase para dados estruturados. INForum 2015. :235-247. Abstractpscan.pdf

Os sistemas NoSQL escalam melhor que os tradicionais sistemas relacionais, motivando a migração de inúmeras aplicações para sistemas NoSQL mesmo quando não se tira partido da estrutura de dados dinâmica por eles fornecida. Porém, a consulta destes dados estruturados tem um custo adicional que deriva da flexibilidade dos sistemas NoSQL. Neste documento, analisa-se esse custo no Apache HBase e apresenta-se o Prepared Scan, uma operação adicional de consulta que visa tirar partido do conhecimento da estrutura de dados por parte da aplicação, diminuindo assim o custo associado à consulta de dados estruturados. Recorrendo à ferramenta de benchmarking YCSB, verifica-se que esta solução tem um aumento no débito de aproximadamente 29%.

Matos M, Mercier H, Felber P, Oliveira R, Pereira JO.  2015.  EpTO: An Epidemic Total Order Algorithm for Large-Scale Distributed Systems. Proceedings of the 16th Annual Middleware Conference. :100–111. Abstractp100-matos.pdf


Paulo J, Pereira JO.  2014.  Distributed Exact Deduplication for Primary Storage Infrastructures. In proceedings of Distributed Applications and Interoperable Systems - IFIP. Abstractpp14a.pdf

Deduplication of primary storage volumes in a cloud computing environment is increasingly desirable, as the resulting space savings contribute to the cost effectiveness of a large scale multi-tenant infrastructure. However, traditional archival and backup deduplication systems impose prohibitive overhead for latency-sensitive applications deployed at these infrastructures while, current primary deduplication systems rely on special cluster filesystems, centralized components, or restrictive workload assumptions.

We present DEDIS, a fully-distributed and dependable system that performs exact and cluster-wide background deduplication of primary storage. DEDIS does not depend on data locality and works on top of any unsophisticated storage backend, centralized or distributed, that exports a basic shared block device interface. The evaluation of an open-source prototype shows that DEDIS scales out and adds negligible overhead even when deduplication and intensive storage I/O run simultaneously.