How to Use Resilience4j to Implement Circuit Breaker?

This course is adapted from the Web Age course Mastering Microservices with Spring Boot and Spring Cloud.

The circuit breaker is a design pattern where you stop executing some code when the previous attempt(s) have failed. For example, calling web services/REST APIs and accessing databases can fail if the backend isn’t up and running or the performance threshold isn’t met. The CircuitBreaker uses a finite state machine with three normal states:

Continue reading “How to Use Resilience4j to Implement Circuit Breaker?”

Building Data Pipelines in Kafka

This tutorial is adapted from Web Age course Kafka for Application Developers Training.

1.1 Building Data Pipelines

Data pipelines can involve various use cases:

  • Building a data pipeline where Apache Kafka is one of the two endpoints. For example, getting data from Kafka to S3 or getting data from MongoDB into Kafka. 
  • Building a pipeline between two different systems but using Kafka as an intermediary. For example, getting data from Twitter to Elasticsearch by sending the data first from Twitter to Kafka and then from Kafka to Elasticsearch. 

The main value Kafka provides to data pipelines is its ability to serve as a very large, reliable buffer between various stages in the pipeline, effectively decoupling producers and consumers of data within the pipeline. This decoupling, combined with reliability security, and efficiency, makes Kafka a good fit for most data pipelines.

1.2 Considerations When Building Data Pipelines

  • Timeliness
  • Reliability
  • High and varying throughput
  • Data formats
  • Transformations
  • Security
  • Failure handling
  • Coupling and agility

1.3 Timeliness

Good data integration systems can support different timeliness requirements for different pipelines. Kafka makes the migration between different timetables easier as business requirements can change. Kafka is a scalable and reliable streaming data platform that can be used to support anything from near-real-time pipelines to hourly batches. Producers can write to Kafka as frequently as needed and consumers can also read and deliver the latest events as they arrive. Consumers can work in batches, when required, such as run every hour, connect to Kafka, and read the events that accumulated during the previous hour. Kafka acts as a buffer that decouples the time-sensitivity requirements between producers and consumers. Producers can write events in real-time while consumers process batches of events or vice versa. The consumption rate is driven entirely by consumers.

1.4 Reliability

Systems failure for more than a few seconds can be hugely disruptive, especially when the timeliness requirement is closer to the few-milliseconds end of the spectrum. Data integration systems should avoid single points of failure and allow for fast and automatic recovery from all sorts of failure events. Data pipelines are often the way data arrives in business-critical systems. Another important consideration for reliability is delivery guarantees. Kafka offers a reliable and guaranteed delivery.

1.5 High and Varying Throughput

The data pipelines should be able to scale to very high throughput. They should be able to adapt if throughput suddenly increases and reduces. With Kafka acting as a buffer between producers and consumers, we no longer need a couple of consumer throughput to the producer throughput. If producer throughput exceeds that of the consumer, data will accumulate in Kafka until the consumer can catch up. Kafka’s ability to scale by adding consumers or producers independently allows us to scale either side of the pipeline dynamically and independently to match the changing requirements. Kafka is a high-throughput distributed system capable of processing hundreds of megabytes per second on even modest clusters. Kafka also focuses on parallelizing the work and not just scaling it out. Parallelizing means it allows data sources and sinks to split the work between multiple threads of execution and use the available CPU resources even when running on a single machine. Kafka also supports several types of compression, allowing users and admins to control the use of network and storage resources as the throughput requirements increase.

1.6 Data Formats

A good data integration platform allows and reconciles different data formats and data types. The data types supported vary among different databases and other storage systems. For e.g. you may be loading XMLs and relational data into Kafka and then need to convert data to JSON when writing it. Kafka itself and the Connect APIs are completely agnostic when it comes to data formats. Producers and consumers can use any serializer to represent data in any format that works for you. Kafka Connect has its own in-memory objects that include data types and schemas, but it allows for pluggable converters to allow storing these records in any format. Many sources and sinks have a schema; we can read the schema from the source with the data, store it, and use it to validate compatibility or even update the schema in the sink database. For e.g. if someone added a column in MySQL, a pipeline will make sure the column gets added to Hive too as we are loading new data into it. When writing data from Kafka to external systems, Sink connectors are responsible for the format in which the data is written to the external system. Some connectors choose to make this format pluggable. For example, the HDFS connector allows a choice between Avro and Parquet formats.

1.7 Transformations

There are generally two schools of building data pipelines:

  • ETL (Extract-Transform-Load)
  • ELT (Extract-Load-Transform)

ETL– It means the data pipeline is responsible for making modifications to the data as it passes through. It has the perceived benefit of saving time and storage because you don’t need to store the data, modify it, and store it again.  It shifts the burden of computation and storage to the data pipeline itself, which may or may not be desirable. The transformations that happen to the data in the pipeline tie the hands of those who wish to process the data farther down the pipe. If users require access to the missing fields, the pipeline needs to be rebuilt and historical data will require reprocessing (assuming it is available).

ELT– It means the data pipeline does only minimal transformation (mostly around data type conversion), with the goal of making sure the data that arrives at the target is as similar as possible to the source data. These are also called high-fidelity pipelines or data-lake architecture. In these systems, the target system collects “raw data” and all required processing is done at the target system. Users of the target system have access to all the data. These systems also tend to be easier to troubleshoot since all data processing is limited to one system rather than split between the pipeline and additional applications.. The transformations take CPU and storage resources at the target system.

1.8 Security

In terms of data pipelines, the main security concerns are:

  • Encryption – the data going through the pipe should be encrypted. This is mainly a concern for data pipelines that cross datacenter boundaries.
  • Authorization – Who is allowed to make modifications to the pipelines?
  • Authentication – If the data pipeline needs to read or write from access-controlled locations, can it authenticate properly?

Kafka allows encrypting data on the wire, as it is piped from sources to Kafka and from Kafka to sinks. It also supports authentication (via SASL) and authorization. Kafka’s encryption feature ensures the sensitive data can’t be piped into less secured systems by someone unauthorized. Kafka also provides an audit log to track access—unauthorized and authorized. With some extra coding, it is also possible to track where the events in each topic came from and who modified them, so you can provide the entire lineage for each record.

1.9 Failure Handling

It is important to plan for failure handling in advance, such as:

  • Can we prevent faulty records from ever making it into the pipeline?
  • Can we recover from records that cannot be parsed?
  • Can bad records get fixed (perhaps by a human) and reprocessed?
  • What if the bad event looks exactly like a normal event and you only discover the problem a few days later?

Because Kafka stores all events for long periods of time, it is possible to go back in time and recover from errors when needed.

1.10 Coupling and Agility

One of the most important goals of data pipelines is to decouple the data sources and data targets.

There are multiple ways accidental coupling can happen:

  • Ad-hoc pipelines
  • Loss of metadata
  • Extreme processing

1.11 Ad-hoc Pipelines

Some companies end up building a custom pipeline for each pair of applications they want to connect.

For example:

  • Use Logstash to dump logs to Elasticsearch
  • Use Flume to dump logs to HDFS
  • Use GoldenGate to get data from Oracle to HDFS
  • Use Informatica to get data from MySQL and XMLs to Oracle

This tightly couples the data pipeline to the specific endpoints and creates a mess of integration points that requires significant effort to deploy, maintain, and monitor. Data pipelines should only be planned for systems where it’s really required.

1.12 Loss of Metadata

If the data pipeline doesn’t preserve schema metadata and does not allow for schema evolution, you end up tightly coupling the software producing the data at the source and the software that uses it at the destination. Without schema information, both software products need to include information on how to parse the data and interpret it. For example: If data flow from Oracle to HDFS and a DBA added a new field in Oracle without preserving schema information and allowing schema evolution, either every app that reads data from HDFS will break or all the developers will need to upgrade their applications at the same time. Neither option is agile. With support for schema evolution in the pipeline, each team can modify their applications at their own pace without worrying that things will break down the line.

1.13 Extreme Processing

Some processing/transformation of data is inherent to data pipelines. Too much processing ties all the downstream systems to decisions made when building the pipelines. For example, which fields to preserve, how to aggregate data. This often leads to constant changes to the pipeline as requirements of downstream applications change, which isn’t agile, efficient, or safe. The more agile way is to preserve as much of the raw data as possible and allow downstream apps to make their own decisions regarding data processing and aggregation.

1.15 Kafka Connect Versus Producer and Consumer

When writing to Kafka or reading from Kafka, you have the choice between using a traditional producer and consumer clients and using the Connect APIs and the connectors. Use Kafka clients when you can modify the code of the application that you want to connect an application to and when you want to either push data into Kafka or pull data from Kafka. Use Connect to connect Kafka to datastores that you did not write and whose code you cannot or will not modify. Connect is used to pull data from the external datastore into Kafka or push data from Kafka to an external store. For datastores where a connector already exists, Connect can be used by non-developers, who will only need to configure the connectors. Connect is recommended because it provides out-of-the-box features like configuration management, offset storage, parallelization, error handling, support for different data types, and standard management REST APIs. If you need to connect Kafka to a datastore and a connector does not exist yet, you can choose between writing an app using the Kafka clients or the Connect API. Writing a small app that connects Kafka to a datastore sounds simple, but there are many little details you will need to handle data types and configurations that make the task non-trivial. Kafka Connect handles most of this for you, allowing you to focus on transporting data to and from the external stores.

1.16 Summary

  • Kafka can be used to implement data pipelines
  • When designing the data pipelines, various factors should be considered.
  • One of the most important Kafka features is its ability to deliver all messages under all failure conditions.

Security in Microservices

This tutorial is adapted from Web Age course Architecting Microservices with Kubernetes, Docker, and Continuous Integration Training.

1.1 Why Microservice Security?

Security is important in all systems and more complicated in a distributed system. We can no longer put our application behind a firewall and assume that nothing will breakthrough. Testing the security of microservice is required and may vary slightly depending on how you implement security. Individual services may just be doing token validation, which means we have to test at the individual service level. Using another service or a library to validate tokens, requires our services to be tested in isolation with those interactions being tested and validated in a production-like environment.

1.2 Security Testing in Microservices

Testing the security of microservice is required and may vary slightly depending on how you implement security. Individual services may just be doing token validation, which means we have to test at the individual service level. Using another service or a library to validate tokens, requires our services to be tested in isolation with those interactions being tested and validated in a production-like environment.

1.3 Security Topology

 

Collections of microservices are grouped as applications, and surface external APIs as endpoints through a gateway. One obvious difference above, from monolithic to microservice, is the number of moving parts. The other is how many more requests there are. In microservice enterprise architecture, you need to be aware of the possibility that the applications will make or respond to many millions of requests and responses.

1.4 Authorization and Authentication

Authentication

  • Establishing the user’s identity is performed by a dedicated, centralized service or an API gateway.
  • Central service could then further delegate user authentication to a third party.

Authorization

  • When establishing a user’s authority or permission to access a secured resource in a microservices environment keep group or role definitions coarse-grained in common, cross-cutting services.
  • Allow individual services to maintain their own fine-grained controls.
  • This is different than typical monolithic applications where fine-grained roles are stored in the application database or some centralized mechanism. For microservices, this could be an antipattern due to the need to modify that central resource when deploying or modifying the services.

1.5 J2EE Security Refresh

Each Java EE server service must be trusted by all of the other services. Users receive security tokens after authentication (in WebSphere Application Server, this is an LTPA Token). User tokens must be valid on all servers. Trust stores are defined so that every server has the certificates of every other server. All services must use the same finite set of key(s). All J2EE servers for the same application(s) must be in the same security realm

1.6 Role-based Access Control in a Nutshell

Implementation of Role-based authentication in code, has our system with an action for creating and updating customers, that we want the ‘Sales’ role to have access to, and so we write code like this:

[Authorize(Roles="Sales")] public ActionResult CreateUpdateCustomer() {     return View(); }

Later, we realized that, sometimes, people from the ‘Operations’ role should be able to create and update Customers. Now we update the Action method as:

[Authorize(Roles = "Sales", "Operations")] public ActionResult CreateUpdateCustomer() {     return View(); }

Later again, we realize that some of the operations people should not be able to do this action, but it is not possible to assign a different role for those people who are in Operations. So, we are forced to allow all operations people to create and update Customers.

 Anytime we change the roles, people are assigned, that should be allowed to create and update customers, we have to update all of our MVC Action methods Authorize attributes, build & deploy, test, and release our application.

Finally, we realize that ‘Operations’ is the wrong additional group for this role and we needed ‘Marketing’, now we’re updating code again across a bunch of services, and re-releasing.

1.7 Claim-based Access Control in a Nutshell

We define some set of claims like this :

“AllowCreateCustomer”, “AllowUpdateCustomer”, “AllowEditCustomer”

Now, we can decorate our Action Method like this:

        [ClaimAuthorize(Permission="CanCreateCustomer")]         public ActionResult CreateCustomer()         {             return View();         }

We can see that, CreateCustomer action method will always need permission ‘CanCreateCustomer’ and it will very likely never change. In our data store, we create a set of permissions (claims). In our data store, we create the user related to the permissions. From our administration interface, we can set permission (claim) for each user who can do an action or operation. We can assign ‘CanCreateCustomer’ permission (claim) to anyone who should be permitted. Users will only be able to do only what has been assigned from a permissions perspective. Claim information is usually passed between requests as a cookie to avoid database access, but again with a view to security and performance

1.8 Sharing Sessions

When we split authentication off from a “monolith” application, we have two challenges to contend with:

  • Sharing cookies between the auth server(s) and application server(s) On one server on one domain, this was not an issue. With multiple servers on multiple domains, it is. We’ll address this challenge by running all servers under one domain and proxying to the various servers.
  • Sharing a session store across server(s) With a single monolith, we can write sessions to disk, store them in memory, or write them to a database running on the same container. This won’t work if we want to be able to scale our application server to many instances as they will not share a memory or a local filesystem. We address this challenge by externalizing our session store and sharing it across instances.

1.9 Session Cookie

 

JSON Web Tokens need to be transient. Creating long-lived JWTs isn’t practical, as they’re self-contained and there’s no way to revoke them.

Upon successful authentication a PersistentJwtTokenBasedRememberMeServices

  • creates a persistent Session object
  • saves it to the database
  • converts the Session into a JWT token
  • persist Session to a cookie on the client’s side (Set-Cookie)
  • create transient JWT

JWT is meant to be used through the lifetime of the single page front-end and passed in the HTTP header (X-Set-Authorization-Bearer).

1.10 JSON Web Token (JWT)

 

JSON Web Token (JWT) is an open standard (RFC 7519) that defines a compact and self-contained way for securely transmitting information between parties as a JSON object. This information can be verified and trusted because it is digitally signed. JWT contains the header and the payload. The first part is the token’s header which identifies the token’s type and the algorithm used to sign the token. The second part is the JWT token’s payload or its claims. There’s a distinction between these two. A payload can be an arbitrary set of data, it can be even plain text or another (nested JWT). Claims on the other hand are a standard set of fields.

https://jwt.io

 

1.11 Spring Security

Spring Security is a powerful and highly customizable authentication and access-control framework. It is the de-facto standard for securing Spring-based applications.

  • Spring Security features

  • Comprehensive and extensible support for both Authentication and Authorization
  • Protection against attacks like session fixation, clickjacking, cross-site request forgery
  • Servlet API integration
  • Integration with Spring Web MVC

1.12 Summary

  • Microservice security requires new techniques from monolithic applications
  • JSON Web Tokens can be created and manipulated by multiple frameworks
  • Role-based and claim-based security allows teams to secure applications