Let’s refer to our pipeline again for reference:

In our previous tutorial (Connecting S3 with Lambda on AWS CDK in Typescript), we successfully sent a notification to our AWS Lambda that a file is created within the S3 bucket whenever a user puts a file into the same bucket.

We can start writing unit tests for our Lambda function to make sure our code works exactly as intended when we deploy our code. This isn’t about deploying resources or infrastructure within the AWS ecosystem, but it is a good practice to start testing code as early as possible in your development cycle.

We aren’t going to cover the concept of test-driven development for now, but you can read up on it here: https://en.wikipedia.org/wiki/Test-driven_development

What is a unit test?

In layman’s terms, a unit test is where we test the functionality of a code snippet to ensure functionality and readiness for deployment.

Unit tests work especially well for AWS Lambdas because Lambda functions are designed to be small compute units that should be treated as singular objects. Overloading your Lambda functions with multiple functionalities is often the wrong way to design your Lambda functions.
Unit tests can cover a multitude of things, such as code complexity, how long it takes to run the code snippet, and functionality. In this tutorial, we will cover only a few items:

  • Code functionality
  • Code coverage
  • Coding standard/style guide

Writing out unit tests is vital for any data engineer in their scope of work. Even data scientists should ensure some code testing is involved to reduce errors.

Preparing your unit testing framework

Before we begin doing unit testing, we want to isolate the python environment from our local machine settings first.

That’s because it would obfuscate import errors during deployment. You could have python packages that are installed within your local machines but your production environment doesn’t have the same packages installed, and your unit test would fail on the production environment even if the unit tests have passed on your local machine.

  • Run pip install pipenv to install the pipenv package.
  • Within the root directory of your project, run: pipenv install
    • It is important to take note of where you install your pipenv. Installing it within the wrong subfolder would render it useless.
  • Run pipenv shell to activate the virtual environment on your terminal.

Within the shell, you would have a clean environment without any packages installed. This is the perfect starting point.
We would then install the following packages for testing:

  • pytest – For testing code.
  • coverage – Generates a report of how much code we have tested.
  • pylint – Identifies coding standards on our code.
  • pytest-pylint – Not necessary at this point, but I installed it to coordinate pylint within pytest.

For such packages, I usually install them within the dev environment for pipenv. That’s because they aren’t actual packages that we need to deploy our AWS Lambda, and hence we shouldn’t bloat our packages during deployment.

 This might be a foreign concept to those who aren’t familiar with CI/CD pipelines, but we’ll get there in another tutorial someday.
To install these packages, run: pipenv install pytest coverage pylint --dev

Once installed, you can see these packages listed within the Pipfile:

[[source]]
name = "pypi"
url = "https://pypi.org/simple"
verify_ssl = true

[dev-packages]
pylint = "*"
pytest = "*"
coverage = "*"
pytest-pylint = "*"

[packages]

[requires]
python_version = "3.8"

Writing testing scripts for our AWS Lambda

As we are using an AWS CDK Typescript template, there is already an existing test folder within the project. AWS CDK encourages developers to write test cases for infrastructure deployment.

However, we will not cover AWS CDK testing today but focus on the AWS Lambda aspect instead.
Within the test folder, I created a lambda folder and created a test_basic_lambda.py file.

The naming convention is especially important for pytest. If a file name has a “test_” prefix, pytest can identify them for testing if we have multiple testing scripts within the same folder.

Also, I created an input folder to store my s3_notification.json that I created from my previous tutorial (See: Connecting S3 With Lambda on AWS CDK in Typescript). We will be using this file to simulate an incoming S3 notification to the Lambda.

You can name your files and folders however you want, but this would be a structure that I will be using.

At this current juncture, our Lambda function only returns a status code payload. We are going to test against that for now, and as our project grows, our unit testing features would grow along with it.

In test_basic_lambda.py:

"""
File: test_basic_lambda.py
Description: Runs a test for our 'gefyra-basic-lambda' Lambda
"""
import os
import sys
import json

# Getting to the Lambda directory
sys.path.append(os.path.join(os.path.dirname(os.path.realpath(__file__)),"../../src/lambda"))

#pylint: disable=wrong-import-position
from basic_lambda.lambda_function import lambda_handler
#pylint: enable=wrong-import-position

def test_initialization():
    """
    Testing an empty payload event to the Lambda
    """
    event = {}
    context = None

    payload = lambda_handler(event, context)
    assert payload['statusCode'] == 200

def test_s3_notification():
    """
    Testing a mock S3 notification event to the Lambda
    """
    input_dir = os.path.join(os.path.dirname(os.path.realpath(__file__))) + "/input"
    json_file_dir = input_dir + "/s3_notification.json"

    # Extracts the JSON file into a dict
    with open(json_file_dir, encoding="utf-8") as json_file:
        event = json.load(json_file)
    context = None

    payload = lambda_handler(event, context)
    assert payload['statusCode'] == 200

# For direct invocation and testing on the local machine
if __name__ == '__main__':
    test_s3_notification()

Some key takeaways:

  • Reference your Lambda code as an object function by importing them into your test script. That way, you can reference the expected output from your function.
  • We are simulating the Lambda function in our local machine environment for unit testing.
  • For best practices, 1 testing script testing for testing a specific resource/functionality. That’s because we can use the nomenclature of the scripts to categorize and strategize our test plan.
    • Example: If my pipeline has multiple Lambdas and other resources, it isn’t good practice to test everything under a script. Separating them out by their resources and functionality with proper nomenclature allows better coverage for testing.

For our use case, I will be writing test code for our intended ETL pipeline.

Testing our code with Pytest

Within the test/lambda folder in your terminal, run: pytest . or pytest test_basic_lambda.py
As mentioned earlier, pytest identifies the testing scripts with the “test_” prefix.

If the code breaks midway at any given point, or if the expected result did not match our test case, pytest would give an error.
The number of “test_” functions within your scripts would indicate the number of test cases pytest would do.

Test coverage for our AWS Lambda

We typically don’t run pytest in a vacuum because we would need to know how much code was being tested with our scripts.

Run coverage run -m pytest <test_script>  to test and generate a test coverage report.

It would appear that it was only running a pytest command, but follow up immediate with: coverage report

It would tell us how many statements are being tested, missed, and the percentage coverage for each script. For industry standards, a threshold of 90% and above is expected.

If you want a better breakdown of what is missing within the testing, you can type: coverage html.
It would create a htmlcov folder within your project as shown:

Ignore all the files within, except the index.html folder. In a nutshell, coverage generated an HTML report where you can open the index.html file in your browser:

There are links within that shows which code isn’t tested at all.

Coverage is an impressive package that can help to identify specific statements within your AWS Lambda function for testing.

Pylint For Coding Standards

Last but not least, all data and software engineers need coding standards to ensure work can be shared among our peers. It would be disastrous if no one except the author can read an AWS Lambda function.

pylint isn’t perfect because it covers only the surface of coding standards, and you would need to default to your team’s engineering style guides as well. But as a start, pylint is great as it uses PEP 8 as its style guide: https://www.python.org/dev/peps/pep-0008/

To use pylint, run: pylint <file name>.py
It would generate a report for us to fix:

Some things to take note of:

  • For some reason, pylint flagged an import error within my test script even though I’m referencing my AWS Lambda function accurately. This has got to do with my local machine environments and nothing to do with missing imports.
    • Thus, pytest is a better indicator for code errors.
  • The context variable in an AWS Lambda function is rarely used because it contains the metadata of the runtime environment that it is running on. However, it is necessary to leave it within your lambda_handler even if pylint flags it as a warning for not using the variable.

In a nutshell

With these testing tools, we are able to mitigate errors from happening to our AWS Lambda during an ETL process. In future tutorials, we will be able to test within our deployments to ensure functionality and readiness to our pipeline.

Take note that our unit test here doesn’t cover everything. We have not talked about the time taken for the AWS Lambda to complete its task and other things like mocking AWS resources to test. However, we will re-visit this topic again if we come to it.

Repo for reference: https://github.com/jonathan-moo/gefyra-cdk-demo