Home > Resources > Blog

Distributed Computing Concepts for Data Engineers

November 15, 2019 by Karandeep Kaur
Categories: Big Data , Data Science and Business Analytics

1.1 The Traditional Client–Server Processing Pattern

It is good for small-to-medium data set sizes. Fetching 1TB worth of data might take longer than 1 hour.

1.2 Enter Distributed Computing

We need a new approach — distributed computing where processing is done on clusters of computers.

But there are challenges like Process (Software) / Hardware failures, Data consistency, System (High) Availability and Distributed processing optimizations.

1.3 Data Physics

Distributed data processing comes with inherent Data physics which has two main aspects:

Data Locality (a.k.a Distributed Computing Economics)
The CAP Theorem outcomes

1.4 Data Locality (Distributed Computing Economics)

To improve distributed system performance, some computational frameworks (e.g. Hadoop’s MapReduce) start a computation task on the node where the data to be worked on is stored (the data locality principle). This helps avoid unnecessary network operations by bringing the computation to the data. Massively parallel processing can be achieved this way by splitting the workload into units of work (independent computation tasks) and intelligently distributes them across available worker nodes for parallel processing. “Data locality” (collocation of data with the compute node) underlies the principles of Distributed Computing Economics. Another term to refer to this model is “Intelligent job scheduling“.

1.5 The CAP Theorem

The CAP theorem was formulated by Eric Brewer-http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf.

It states that any distributed computer system can have at most two of three desirable properties:

C – consistency is equivalent to having a single up-to-date copy of the data; all client always have the same view of the data.
A – high availability of the data; all the clients can always read and write.
P – tolerance to network partitions (in distributed system deployments).

1.6 Mechanisms to Guarantee a Single CAP Property

Some of the mechanisms to ensure a single CAP property are:

Consistency – pessimistic locking
Availability – optimistic locking
Partition – two-phase commit

Big Data systems almost always partition data, leaving designers with a choice between data consistency and availability.

1.7 Eventual Consistency

Availability is achieved with solutions based on data caching or similar mechanisms which may result in stale or inconsistent view of data. Strong consistency is a critical property of financial transactional systems where any data inconsistency is automatically treated as an error. In many business scenarios, companies prioritize availability over data consistency letting consistency take its cause at a later time (the case of eventual consistency)