What is AWS Glue?
Before we dive in to the AWS Glue tutorial, let’s briefly answer the common question of
“What actually IS AWS Glue?”
AWS Glue is a serverless tool developed for the purpose of extracting, transforming, and loading data. This process is referred to as ETL.
ETL refers to three (3) processes that are commonly needed in most Data Analytics / Machine Learning processes: Extraction, Transformation, Loading. Extracting data from a source, transforming it in the right way for applications, and then loading it back to the data warehouse, all from the cloud. AWS helps us to make that happen.
AWS Glue is also a fully managed service, which means we as users don’t have to manage any cloud infrastructure, it’s all taken care of by Amazon. AWS console UI offers straightforward ways for us to perform the whole task to the end. No extra code scripts are needed.
AWS Glue runs serverlessly, meaning that there is no infrastructure management, provisioning, configuring, or scaling of resources that you have to do. You only pay for the resources that are used while running a job.
AWS Glue Tutorial: AWS Glue PySpark Extensions
1.1 AWS Glue and Spark
1.2 The DynamicFrame Object
1.3 The DynamicFrame API
df.drop_fields(['other_names','identifiers'])
df.rename_field('id', 'org_id').rename_field('name', 'org_name')
partitions = df.filter("type = 'partition'")
df.rdd.map(lambda row: (row[id_col], {row[key]: row[value]}))
df.select_fields(['organization_id']).toDF().distinct().show()
1.4 The GlueContext Object
glueContext = GlueContext(SparkContext.getOrCreate())
1.5 Glue Transforms
dyf_joined = Join.apply(dyf_1, dyf_2, j_col_dyf_1, j_col_dyf_2)
1.6 A Sample Glue PySpark Script
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
orders = glueContext.create_dynamic_frame.from_catalog(database="sx_db",
table_name="order_csv")
# orders is of type <class 'awsglue.dynamicframe.DynamicFrame'>
# You can get the count of records in the DynamicFrame with this command: orders.count()
# Projections (using PySpark's DataFrame API):
# orders.select_fields(['order id', 'employee id', 'customer id', 'order summary']).toDF().show(5)
# Renaming columns (fields) orders.rename_field("`payment type`", "pmtt").toDF().columns
order_details = glueContext.create_dynamic_frame.from_catalog(database="sx_db",
table_name="order_details_csv")
# Joining two Glue DynamicFrames on the 'order id' column (field)
dyf_joined = Join.apply(order_details, orders, 'order id', 'order id')
1.7 Using PySpark
# Here is how you can access S3 using PySpark: orders = spark.read.format("com.databricks.spark.csv") \ .option("header", "true") \ .option("inferSchema", "true") \ .option("sep", "\t") \ .load('s3://webage-data-sets/glue-data-sets/order.csv') # orders object is Spark's DataFrame object, which you can convert to Glue's DynamicFrame object using this code: from awsglue.dynamicframe import DynamicFrame orders_dyf = DynamicFrame.fromDF(orders, glueContext, "orders_dyf")
1.8 AWS Glue PySpark SDK
import boto3 glue = boto3.client(service_name='glue', region_name='us-east-1', endpoint_url='https://glue.us-east-1.amazonaws.com')
glueContext = GlueContext(SparkContext.getOrCreate())
Notes:
Glue client code sample
Here is an example of a Glue client packaged as a lambda function (running on an automatically provisioned server (or servers)) that invokes an ETL script to process input parameters (the code samples are taken and adapted from this source: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-calling.html)
The lambda function code:
from datetime import datetime, timedelta glue_client = boto3.client('glue') # This is the callback invoked by AWS in response to an event (e.g. a record is # inserted into a DynamoDB NoSQL database) def lambda_handler(event, context): last_hour_date_time = datetime.now() - timedelta(hours = 1) day_partition_value = last_hour_date_time.strftime("%Y-%m-%d") hour_partition_value = last_hour_date_time.strftime("%-H") response = glue_client.start_job_run( JobName = 'my_test_Job', Arguments = { # a set of key-value pairs '--day_partition_key': 'partition_0', '--hour_partition_key': 'partition_1', '--day_partition_value': day_partition_value, '--hour_partition_value': hour_partition_value } )
The AWS Glue script:
import sys from awsglue.utils import getResolvedOptions # getResolvedOptions offers a reliable way to access values in the sys.argv list args = getResolvedOptions(sys.argv, ['JOB_NAME', # 'my_test_Job 'day_partition_key', 'hour_partition_key', 'day_partition_value', 'hour_partition_value']) print "The day partition key is: ", args['day_partition_key'] print "and the day partition value is: ", args['day_partition_value']
Note that each of the arguments is defined as beginning with two hyphens, then referenced in the script without the hyphens. Your arguments need to follow this convention to be resolved.
View our Complete AWS Course Catalog
Check out our full catalog of AWS Courses to get started with your AWS Certification journey.
AWS Glue features in several of our AWS certification training courses.
More to Explore:
Tech Blogs That Every IT Person Should Read.
Stay Updated. Learn about the latest technologies that are emerging in our market.
2021 Must Learn Technologies and Top Courses
Want to stay ahead of your competition?
View the Top Courses & Technologies you must learn for 2021!
Congratulations!
Seriously, there’s so much to do with AWS Glue!
AWS Glue features in several of our AWS certification training courses including the following:
Data Analytics on AWS
Building Data Lakes on AWS
Big Data on AWS
Data Science and Data Engineering for Architects
AWS Advanced Analytics for Structured Data
Have a friend who would enjoy this tutorial? Invite them to read it!