User-Defined Functions (UDF)

Introduction

This tutorial guides through the process of creating user-defined functions. These functions facilitate the development of custom business logic, which can be stored as an API. This API can then serve as a building block for an application or as a tool for integrating information with other enterprise systems.

Here is a training module that may help you get familiar with UDF

Understanding the Framework

By understanding these two key concepts, you'll be better equipped to create and customize bespoke applications within the Sight Machine platform.

Bespoke Analytics Framework

What is it?
This framework provides a structured way to create custom applications within the Sight Machine platform. It uses a configuration-driven approach, meaning you define the application's structure and behavior through configuration files rather than writing extensive code.
Why is it important?
The Bespoke Analytics Framework simplifies the process of building custom applications, making it more accessible to users with less technical expertise. It also ensures consistency and maintainability across different applications.

User-Defined Function

What is it?
UDFs are custom functions that you can write to perform specific calculations or data manipulations within your bespoke applications. They allow you to tailor the analytics to your exact needs.
Why is it important?
UDFs provide flexibility and customization, enabling you to create bespoke applications that are tailored to your specific use cases and data requirements.

Environments

Having an understanding of the different environment application allows you to test and iterate on new applications without disrupting the live system. This helps to ensure the stability and reliability of your applications.

Test Environment: The development or testing environment is where you can create and experiment with new applications without affecting the production environment.
Live Environment: The production or live environment is where your live applications are deployed and used by end-users.

Creating a User-Defined Function: Local Development

Local Development involves developing UDFs on your own computer using a Jupyter Notebook. This approach gives you more flexibility and control over your development environment, but it also requires setting up additional tools and managing dependencies.

⚠️ Note: This workflow is recommended for Voltron 2.

Creating from a Template

Step 1: Open the Hamburger Menu in the upper right corner.

Step 2: Select the Dev Panel option from the menu.

Please note, this option is only accessible to users with a developer role or higher.

Step 3: Locate and select the User Defined Notebook API. This is where the list of user-defined functions is housed.

In this environment, no existing or saved user-defined functions are present. It provides instructions on how to access ODBC within the notebook and how to execute it.

Step 4: Download the template file. This file, a Jupyter notebook, includes an example of creating a user-defined function.

Downloading and Accessing the Notebook File

This guide explains how to download a Jupyter notebook file, modify it, and upload it as a user-defined function.

Step 1: Download the notebook file as a template.

Step 2: Open the terminal and start your Jupyter notebook.

Step 3: Navigate to the location of the downloaded file. For this example, the file is in the Downloads folder.

Modifying the UDF Template

Step 1: After opening the template, you will notice 3 distinct sections.

Section 1: Initialize Parameters

The first section initializes parameters and contains a parameter tag. These can be modified by passing in new values to the API, such as date parameters or different machines. In the example below, a, b, and c are all parameters initialized in the first cell.

Section 2: Business Logic

The next section contains the business logic. This is a simple example and in practice this can take up as many cells as is necessary.

Section 3: API Response

The final cell returns the output via the API response.

Step 1: Modify the template as needed.

Step 2: Execute the code to see the expected result. In this example, without passing any parameters, the API response should be 2.

Step 3: Save your notebook.

Uploading the Modified Notebook File

Step 1: Return to the developer page and choose the saved file from the appropriate location.

Step 2: Provide a name for the UDF. This will be the name of the API that's being called.

Step 3: Provide a description for the function.

Step 4: Select "upload". A confirmation message will appear indicating a successful upload.

The new function is now visible, complete with its description and the time it was last saved, and ready to run!

Creating a User-Defined Function: In-Platform Development

In-platform development provides a convenient and streamlined way to create User-Defined Functions (UDFs) directly within the Sight Machine platform. This approach eliminates the need for setting up a local development environment and simplifies the process of creating and managing UDFs.

By utilizing the in-platform development tools, you can efficiently write, test, and deploy UDFs without leaving the Sight Machine platform. This can significantly reduce development time and effort, allowing you to focus on creating powerful and customized analytics for your bespoke applications.

⚠️ Note: This workflow is recommended for Voltron 1.

Creating from a MA Notebook

Step 1: Open the Hamburger Menu in the upper right corner.

Step 2: Select the Dev Panel option from the menu.

Please note, this option is only accessible to users with a developer role or higher.

Step 3: Locate and select the MA Notebooks link. This is where notebooks can be developed and saved as UDFs.

Creating a New Notebook

The MA Notebooks page displays a list of all created notebooks and their current status, a standard feature for Jupyter notebooks.

To create a new notebook, follow these steps:

Step 1: Select the New notebook option from the header.

Step 2: Select Python (udf) as the notebook kernel.

After selecting, the new notebook appears in a separate tab, offering a familiar notebook experience.

Notebook Development and Testing

As in the local development template example, your notebook should contain three distinct sections:

Section 1: Initialize Parameters

Section 2: Business Logic

The next section contains the business logic. This is a simple example and in practice this can take up as many cells as is necessary.

Section 3: API Response

The final cell returns the output via the API response.

Saving the Notebook as a UDF

This tutorial outlines the process of saving a new user-defined function (UDF) within a notebook and uploading it to the UDF section.

Step 1: Finalize the new user-defined function within the notebook and select the Save as UDF option within the notebook.

Step 2: Assign a name to the UDF, such as "My Data Quality Test".

Step 3: Input a description for the UDF.

Step 4: Select upload. A confirmation message will appear upon successful upload.

Step 5: The UDF has been saved and you can navigate to the User Defined Notebooks API section within the /dev page for confirmation.

Running Your UDF

After uploading, an api request can be POSTed to {environment_name}.sightmachine.io/v1/udf/task/async with the following body format (parameters are optional):
```
{
    "name": <notebook name>,
    "parameters": {
        <param>: <value>,
        ...
    }
}
```
You will receive acknowledgement of a successful submission along with a task_id. You can then poll for your results at {environment_name}.sightmachine.io/v1/udf/task/async/{task_id}

Debugging Your UDF

Step 1: Right-click anywhere on the Sight Machine environment and select "Inspect" or use the keyboard shortcut Ctrl+Shift+I (Windows/Linux) or Cmd+Option+I (Mac). Then, click on the "Network" tab in the DevTools window. Step 2: Click on the last UDF packet, then select "Response". Step 3: Identify the error by checking the data string below the [state: "SUCCESS"] item. This will directly identify the exact line of code within the notebook that's generating the Python error.

Example UDF

Data Quality

The following UDF analyzes a given machine_type to find fields which have recently been all null or only a single static non-null value.

Default Parameters

The first executable cell of the notebook, or the cell containing the parameters tag, sets the default parameter values for the API.

API_KEY = ""
API_SECRET = ""
machine_type = ""
days = 5
include_last_non_null = False
schema_name = ""

Imports

Import packages required for your UDF.

import sqlalchemy as sqla
from smsdk import client
from IPython.display import JSON

NOTE:

Only some packages are available for User Defined Functions. See the list below for currently available packages. If new packages are needed, please reach out to your customer success representative.

Available Packages

UDFs only support the following packages:

asttokens==2.4.1
attrs==21.4.0
backcall==0.2.0
boto3==1.26.67
botocore==1.29.67
cerberus==1.3.4
certifi==2023.7.22
charset-normalizer==3.0.1
comm==0.2.2
coverage==7.1.0
debugpy==1.5.1
decorator==5.1.1
deepdiff==6.3.0
dill==0.3.8
entrypoints==0.4
exceptiongroup==1.2.2
executing==2.0.1
git+https://github.com/sightmachine/sightmachine-sdk.git@0fa400d0158cbc76b678802590abf9471c71ea0e
google-api-core==1.22.2
google-cloud-core==1.4.1
google-cloud-storage==1.31.2
google-crc32c==1.1.0
google-resumable-media==1.2.0
googleapis-common-protos==1.52.0
google-auth==1.21.3
greenlet==0.4.14
gunicorn==22.0.0
idna==3.7
importlib-resources==5.10.2
iniconfig==2.0.0
ipykernel==6.9.2
ipython==7.32.0
ipython-genutils==0.2.0
ipywidgets==8.1.1
jedi==0.18.2
joblib==1.2.0
jupyter-client==7.1.2
jupyter-core==4.11.2
jupyterlab-widgets==3.0.11
Levenshtein==0.23.0
matplotlib==3.7.1
matplotlib-inline==0.1.3
mock==4.0.3
multiprocess==0.70.16
nest-asyncio==1.5.4
numpy==1.21.6
openai==0.27.2
packaging==23.0
pandas==1.1.5
parso==0.8.3
pathos==0.3.2
pexpect==4.8.0
pickleshare==0.7.5
platformdirs==4.2.2
plotly==5.18.0
pluggy==1.0.0
pox==0.3.4
ppft==1.7.6.8
prompt-toolkit==3.0.28
psutil==5.9.4
psycopg2==2.9.5
ptyprocess==0.7.0
pure-eval==0.2.3
pyOpenSSL==24.1.0
pyparsing==3.1.2
PyYAML==6.0
pyzmq==25.0.0
Pygments==2.11.2
pytest==7.2.1
python-dateutil==2.8.2
pytz==2022.7.1
requests==2.28.2
ruamel.yaml==0.18.6
scikit-learn==1.0.2
scipy==1.9.3
six==1.16.0
SQLAlchemy==1.4.32
stack-data==0.6.3
swifter==1.2.0
tenacity==8.5.0
tiktoken==0.3.2
tomli==2.0.1
tornado==6.1
traitlets==5.1.1
typing-extensions==4.4.0
urllib3==1.26.14
wcwidth==0.2.5
widgetsnbextension==4.0.11
zipp==3.7.0

Validate Parameters

Validate the parameters including machine_type.

tenant = ""
cli = client.Client(tenant)
cli.login('apikey', secret_id=API_SECRET, key_id=API_KEY)
cli.select_db_schema(schema_name=schema_name)
machine_types = cli.get_machine_type_names(clean_strings_out=False)
if machine_type not in machine_types:
    raise RuntimeError(f"machine_type not found: {machine_type}")

Setup ODBC Connection

Connect to the DB using the provided API Key and Secret.

connection_string = f"postgresql+psycopg2://{API_KEY}:{API_SECRET}@pgbouncer:6432/tenant_storage"
engine = sqla.create_engine(connection_string)

Perform Analysis

Run the query and identify columns with null and static fields.

table_name = f"cycle_{machine_type.lower()}"
query = sqla.text(f"""
SELECT column_analysis(:s_name, :t_name, :days) AS column_analysis, count(*) AS total_rows_checked
FROM {engine.dialect.identifier_preparer.quote_schema(schema_name)}.{engine.dialect.identifier_preparer.quote_identifier(table_name)}
WHERE "Cycle End Time" > NOW() - INTERVAL ':days days'
""")
with engine.connect() as conn:
    res = conn.execute(query, s_name=schema_name, t_name=table_name, days=days)
    data = [dict(row) for row in res]

null_fields = []
static_fields = []
column_analysis = data[0]["column_analysis"][0]
for column_name, distinct_count in column_analysis.items():
    if distinct_count == 0:
        null_fields.append(column_name)
    elif distinct_count == 1:
        static_fields.append(column_name)

Format Response

The last executable cell of the notebook contains the response of the UDF.

response = {
    schema_name: {
        machine_type: {
            "days": days,
            "total_rows_checked": data[0]["total_rows_checked"],
            "null_field_count": len(null_fields),
            "null_fields": null_fields,
            "static_field_count": len(static_fields),
            "static_fields": static_fields
        }
    }
}
JSON(response)