What is AWS Glue?

Before we dive in to the AWS Glue tutorial, let’s briefly answer the common question of “What actually IS AWS Glue?”

AWS Glue is a serverless tool developed for the purpose of extracting, transforming, and loading data. This process is referred to as ETL.

ETL refers to three (3) processes that are commonly needed in most Data Analytics / Machine Learning processes: Extraction, Transformation, Loading. Extracting data from a source, transforming it in the right way for applications, and then loading it back to the data warehouse, all from the cloud. AWS helps us to make that happen.

AWS Glue is also a fully managed service, which means we as users don’t have to manage any cloud infrastructure, it’s all taken care of by Amazon. AWS console UI offers straightforward ways for us to perform the whole task to the end. No extra code scripts are needed.

AWS Glue runs serverlessly, meaning that there is no infrastructure management, provisioning, configuring, or scaling of resources that you have to do. You only pay for the resources that are used while running a job.






    AWS Glue Tutorial

    Fill in the form and get this AWS Glue Course at 25% off! 

    Big Data on AWS

     

    AWS Glue Tutorial: AWS Glue PySpark Extensions

    1.1 AWS Glue and Spark

    AWS Glue is based on the Apache Spark platform extending it with Glue-specific libraries. In this AWS Glue tutorial, we will only review Glue’s support for PySpark. As of version 2.0, Glue supports Python 3, which you should use in your development.

     

    1.2 The DynamicFrame Object

    AWS Glue API is centered around the DynamicFrame object which is an extension of Spark’s DataFrame object.

    DynamicFrame offers finer control over schema inference and some other benefits over the standard Spark DataFrame object. These benefits come from the DynamicRecord object that represents a logical record in a DynamicFrame. DynamicRecord is similar to a row in the Spark DataFrame except that it is self-describing and can be used for rows that do not conform to a fixed schema. DynamicRecord offers a way for each record to self-describe itself without requiring up-front schema definition. Glue creators allow developers to programmatically switch between the DynamicFrame and DataFrame using the DynamicFrame’s toDF() and fromDF() methods.

     

    1.3 The DynamicFrame API

    fromDF() / toDF()

    count() – returns the number of rows in the underlying DataFrame

    show(num_rows) – prints a specified number of rows from the underlying DataFrame

    Selected transforms with usage examples:

    drop_fields

    df.drop_fields([‘other_names’,’identifiers’])

    rename_field

    df.rename_field(‘id’, ‘org_id’).rename_field(‘name’, ‘org_name’)

    filter

    partitions = df.filter(“type = ‘partition’”)

    map

    df.rdd.map(lambda row: (row[id_col], {row[key]: row[value]}))

    select_fields

    df.select_fields([‘organization_id’]).toDF().distinct().show()

    For more details on AWS Glue PySpark extensions and transformations, click here

     

    aws glue tutorial

    View our Complete AWS Course Catalog

    Check out our full catalog of AWS Courses to get started with your AWS Certification journey.

    AWS Glue features in several of our AWS certification training courses.

    Learn More

     


     

    More to Explore:

    aws glue tutorial blogs

    Tech Blogs That Every IT Person Should Read.

    Stay Updated. Learn about the latest technologies that are emerging in our market.

    Read More


     

    aws glue tutorial 2021

    2021 Must Learn Technologies and Top Courses

    Want to stay ahead of your competition? View the Top Courses & Technologies you must learn for 2021!

    View Now

    1.4 The GlueContext Object

    GlueContext is the wrapper around SparkContext object that you need to create before you can use the Glue API.

    GlueContext creation code:

    glueContext = GlueContext(SparkContext.getOrCreate())

     

    1.5 Glue Transforms

    Transforms in Glue are classes that help you code logic for manipulating your data, including:

    DropFields, DropNullFields, Filter, Join, RenameField, SelectFields, and others

    Example of using the Join transform for joining two DynamicFrames:

    dyf_joined = Join.apply(dyf_1, dyf_2, j_col_dyf_1, j_col_dyf_2)

    For more details on the available Glue transforms, visit here

    1.6 A Sample Glue PySpark Script

    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
    glueContext = GlueContext(SparkContext.getOrCreate())
    orders = glueContext.create_dynamic_frame.from_catalog(database=”sx_db”,
    table_name=”order_csv”)
    # orders is of type <class ‘awsglue.dynamicframe.DynamicFrame’>
    # You can get the count of records in the DynamicFrame with this command: orders.count()
    # Projections (using PySpark’s DataFrame API):
    # orders.select_fields([‘order id’, ’employee id’, ‘customer id’, ‘order summary’]).toDF().show(5)
    # Renaming columns (fields) orders.rename_field(“`payment type`”, “pmtt”).toDF().columns
    order_details = glueContext.create_dynamic_frame.from_catalog(database=”sx_db”,
    table_name=”order_details_csv”)
    # Joining two Glue DynamicFrames on the ‘order id’ column (field)
    dyf_joined = Join.apply(order_details, orders, ‘order id’, ‘order id’)

     

    1.7 Using PySpark

    # Here is how you can access S3 using PySpark:
    orders = spark.read.format(“com.databricks.spark.csv”) \
             .option(“header”, “true”) \
             .option(“inferSchema”, “true”) \
             .option(“sep”, “\t”) \
             .load(‘s3://webage-data-sets/glue-data-sets/order.csv’)
    # orders object is Spark’s DataFrame object, which you can convert to Glue’s DynamicFrame object using this code:

    from awsglue.dynamicframe import DynamicFrame
    orders_dyf = DynamicFrame.fromDF(orders, glueContext, “orders_dyf”)

     

    1.8 AWS Glue PySpark SDK

    PySpark integrates with AWS SDK via AWS boto3 module:

    import boto3
    glue = boto3.client(service_name=’glue’, region_name=’us-east-1′,
    endpoint_url=’https://glue.us-east-1.amazonaws.com‘)

    Most of AWS Glue functionality comes from the awsglue module.The Facade API object awsglue.context.GlueContext wraps the Apache SparkSQL SQLContext object and you create it like so:

    glueContext = GlueContext(SparkContext.getOrCreate())

    AWS Glue ETL code samples can be found here

     

    Notes:

    Glue client code sample

    Here is an example of a Glue client packaged as a lambda function (running on an automatically provisioned server (or servers)) that invokes an ETL script to process input parameters (the code samples are taken and adapted from this source)

    The lambda function code:

    from datetime import datetime, timedelta

    glue_client = boto3.client(‘glue’)

    # This is the callback invoked by AWS in response to an event (e.g. a record is
    # inserted into a DynamoDB NoSQL database)
    def lambda_handler(event, context):
    last_hour_date_time = datetime.now() – timedelta(hours = 1)
    day_partition_value = last_hour_date_time.strftime(“%Y-%m-%d”)
    hour_partition_value = last_hour_date_time.strftime(“%-H”)

    response = glue_client.start_job_run(
    JobName = ‘my_test_Job’,
    Arguments = {           # a set of key-value pairs
    ‘–day_partition_key’: ‘partition_0’,
    ‘–hour_partition_key’: ‘partition_1’,
    ‘–day_partition_value’: day_partition_value,
    ‘–hour_partition_value’: hour_partition_value } )

     

    The AWS Glue script:

    import sys
    from awsglue.utils import getResolvedOptions

    # getResolvedOptions offers a reliable way to access values in the sys.argv list
    args = getResolvedOptions(sys.argv,
    [‘JOB_NAME’,           # ‘my_test_Job
    ‘day_partition_key’,
    ‘hour_partition_key’,
    ‘day_partition_value’,
    ‘hour_partition_value’])
    print “The day partition key is: “, args[‘day_partition_key’]
    print “and the day partition value is: “, args[‘day_partition_value’]

    Note that each of the arguments is defined as beginning with two hyphens, then referenced in the script without the hyphens. Your arguments need to follow this convention to be resolved.

    Congratulations!

    You’ve just completed this AWS Glue tutorial!

    You looked at using the AWS Glue ETL Service and AWS Glue PySpark extensions and transforms.

    Seriously, there’s so much to do with AWS Glue!

    AWS Glue features in several of our AWS certification training courses including the following:

    Data Analytics on AWS

    Building Data Lakes on AWS 

    Big Data on AWS

    Data Science and Data Engineering for Architects

    AWS Advanced Analytics for Structured Data