- Print
- DarkLight
User-Defined Functions (UDF)
- Print
- DarkLight
Introduction
This tutorial guides through the process of creating user-defined functions. These functions facilitate the development of custom business logic, which can be stored as an API. This API can then serve as a building block for an application or as a tool for integrating information with other enterprise systems.
Here is a training module that may help you get familiar with UDF
https://rise.articulate.com/share/-6mDXYtJ7VjLxIvMbRKdJaa0l3eEhOFU
Understanding the Framework
By understanding these two key concepts, you'll be better equipped to create and customize bespoke applications within the Sight Machine platform.
Bespoke Analytics Framework
What is it?
This framework provides a structured way to create custom applications within the Sight Machine platform. It uses a configuration-driven approach, meaning you define the application's structure and behavior through configuration files rather than writing extensive code.
Why is it important?
The Bespoke Analytics Framework simplifies the process of building custom applications, making it more accessible to users with less technical expertise. It also ensures consistency and maintainability across different applications.
User-Defined Function
What is it?
UDFs are custom functions that you can write to perform specific calculations or data manipulations within your bespoke applications. They allow you to tailor the analytics to your exact needs.
Why is it important?
UDFs provide flexibility and customization, enabling you to create bespoke applications that are tailored to your specific use cases and data requirements.
Environments
Having an understanding of the different environment application allows you to test and iterate on new applications without disrupting the live system. This helps to ensure the stability and reliability of your applications.
Test Environment: The development or testing environment is where you can create and experiment with new applications without affecting the production environment.
Live Environment: The production or live environment is where your live applications are deployed and used by end-users.
Creating a User-Defined Function: Local Development
Local Development involves developing UDFs on your own computer using a Jupyter Notebook. This approach gives you more flexibility and control over your development environment, but it also requires setting up additional tools and managing dependencies.
⚠️ Note: This workflow is recommended for Voltron 2.
Creating from a Template
Step 1: Open the Hamburger Menu in the upper right corner.
Step 2: Select the Dev Panel option from the menu.
Please note, this option is only accessible to users with a developer role or higher.
Step 3: Locate and select the User Defined Notebook API. This is where the list of user-defined functions is housed.
In this environment, no existing or saved user-defined functions are present. It provides instructions on how to access ODBC within the notebook and how to execute it.
Step 4: Download the template file. This file, a Jupyter notebook, includes an example of creating a user-defined function.
Downloading and Accessing the Notebook File
This guide explains how to download a Jupyter notebook file, modify it, and upload it as a user-defined function.
Step 1: Download the notebook file as a template.
Step 2: Open the terminal and start your Jupyter notebook.
Step 3: Navigate to the location of the downloaded file. For this example, the file is in the Downloads folder.
Modifying the UDF Template
Step 1: After opening the template, you will notice 3 distinct sections.
Section 1: Initialize Parameters
The first section initializes parameters and contains a parameter tag. These can be modified by passing in new values to the API, such as date parameters or different machines. In the example below, a, b, and c are all parameters initialized in the first cell.
Section 2: Business Logic
The next section contains the business logic. This is a simple example and in practice this can take up as many cells as is necessary.
Section 3: API Response
The final cell returns the output via the API response.
Step 1: Modify the template as needed.
Step 2: Execute the code to see the expected result. In this example, without passing any parameters, the API response should be 2.
Step 3: Save your notebook.
Uploading the Modified Notebook File
Step 1: Return to the developer page and choose the saved file from the appropriate location.
Step 2: Provide a name for the UDF. This will be the name of the API that's being called.
Step 3: Provide a description for the function.
Step 4: Select "upload". A confirmation message will appear indicating a successful upload.
The new function is now visible, complete with its description and the time it was last saved, and ready to run!
Creating a User-Defined Function: In-Platform Development
In-platform development provides a convenient and streamlined way to create User-Defined Functions (UDFs) directly within the Sight Machine platform. This approach eliminates the need for setting up a local development environment and simplifies the process of creating and managing UDFs.
By utilizing the in-platform development tools, you can efficiently write, test, and deploy UDFs without leaving the Sight Machine platform. This can significantly reduce development time and effort, allowing you to focus on creating powerful and customized analytics for your bespoke applications.
⚠️ Note: This workflow is recommended for Voltron 1.
Creating from a MA Notebook
Step 1: Open the Hamburger Menu in the upper right corner.
Step 2: Select the Dev Panel option from the menu.
Please note, this option is only accessible to users with a developer role or higher.
Step 3: Locate and select the MA Notebooks link. This is where notebooks can be developed and saved as UDFs.
Creating a New Notebook
The MA Notebooks page displays a list of all created notebooks and their current status, a standard feature for Jupyter notebooks.
To create a new notebook, follow these steps:
Step 1: Select the New notebook option from the header.
Step 2: Select Python (udf) as the notebook kernel.
After selecting, the new notebook appears in a separate tab, offering a familiar notebook experience.
Notebook Development and Testing
As in the local development template example, your notebook should contain three distinct sections:
Section 1: Initialize Parameters
The first section initializes parameters and contains a parameter tag. These can be modified by passing in new values to the API, such as date parameters or different machines. In the example below, a, b, and c are all parameters initialized in the first cell. As an alternative to defining parameters in the first cell, you can tag any cell with the parameters
tag.
Section 2: Business Logic
The next section contains the business logic. This is a simple example and in practice this can take up as many cells as is necessary.
Section 3: API Response
The final cell returns the output via the API response.
Saving the Notebook as a UDF
This tutorial outlines the process of saving a new user-defined function (UDF) within a notebook and uploading it to the UDF section.
Step 1: Finalize the new user-defined function within the notebook and select the Save as UDF option within the notebook.
Step 2: Assign a name to the UDF, such as "My Data Quality Test".
Step 3: Input a description for the UDF.
Step 4: Select upload. A confirmation message will appear upon successful upload.
Step 5: The UDF has been saved and you can navigate to the User Defined Notebooks API section within the /dev
page for confirmation.
Running Your UDF
After uploading, an api request can be POSTed to
{environment_name}.sightmachine.io/v1/udf/task/async
with the following body format (parameters are optional):{ "name": <notebook name>, "parameters": { <param>: <value>, ... } }
You will receive acknowledgement of a successful submission along with a task_id. You can then poll for your results at
{environment_name}.sightmachine.io/v1/udf/task/async/{task_id}
Debugging Your UDF
Step 1: Right-click anywhere on the Sight Machine environment and select "Inspect" or use the keyboard shortcut Ctrl+Shift+I (Windows/Linux) or Cmd+Option+I (Mac). Then, click on the "Network" tab in the DevTools window. Step 2: Click on the last UDF packet, then select "Response". Step 3: Identify the error by checking the data string below the [state: "SUCCESS"] item. This will directly identify the exact line of code within the notebook that's generating the Python error.
Example UDF
Data Quality
The following UDF analyzes a given machine_type to find fields which have recently been all null or only a single static non-null value.
Default Parameters
The first executable cell of the notebook, or the cell containing the parameters tag, sets the default parameter values for the API.
API_KEY = ""
API_SECRET = ""
machine_type = ""
days = 5
include_last_non_null = False
schema_name = ""
Imports
Import packages required for your UDF.
import sqlalchemy as sqla
from smsdk import client
from IPython.display import JSON
NOTE:
Only some packages are available for User Defined Functions. See the list below for currently available packages. If new packages are needed, please reach out to your customer success representative.
Available Packages
UDFs only support the following packages:
asttokens==2.4.1
attrs==21.4.0
backcall==0.2.0
boto3==1.26.67
botocore==1.29.67
cerberus==1.3.4
certifi==2023.7.22
charset-normalizer==3.0.1
comm==0.2.2
coverage==7.1.0
debugpy==1.5.1
decorator==5.1.1
deepdiff==6.3.0
dill==0.3.8
entrypoints==0.4
exceptiongroup==1.2.2
executing==2.0.1
git+https://github.com/sightmachine/sightmachine-sdk.git@0fa400d0158cbc76b678802590abf9471c71ea0e
google-api-core==1.22.2
google-cloud-core==1.4.1
google-cloud-storage==1.31.2
google-crc32c==1.1.0
google-resumable-media==1.2.0
googleapis-common-protos==1.52.0
google-auth==1.21.3
greenlet==0.4.14
gunicorn==22.0.0
idna==3.7
importlib-resources==5.10.2
iniconfig==2.0.0
ipykernel==6.9.2
ipython==7.32.0
ipython-genutils==0.2.0
ipywidgets==8.1.1
jedi==0.18.2
joblib==1.2.0
jupyter-client==7.1.2
jupyter-core==4.11.2
jupyterlab-widgets==3.0.11
Levenshtein==0.23.0
matplotlib==3.7.1
matplotlib-inline==0.1.3
mock==4.0.3
multiprocess==0.70.16
nest-asyncio==1.5.4
numpy==1.21.6
openai==0.27.2
packaging==23.0
pandas==1.1.5
parso==0.8.3
pathos==0.3.2
pexpect==4.8.0
pickleshare==0.7.5
platformdirs==4.2.2
plotly==5.18.0
pluggy==1.0.0
pox==0.3.4
ppft==1.7.6.8
prompt-toolkit==3.0.28
psutil==5.9.4
psycopg2==2.9.5
ptyprocess==0.7.0
pure-eval==0.2.3
pyOpenSSL==24.1.0
pyparsing==3.1.2
PyYAML==6.0
pyzmq==25.0.0
Pygments==2.11.2
pytest==7.2.1
python-dateutil==2.8.2
pytz==2022.7.1
requests==2.28.2
ruamel.yaml==0.18.6
scikit-learn==1.0.2
scipy==1.9.3
six==1.16.0
SQLAlchemy==1.4.32
stack-data==0.6.3
swifter==1.2.0
tenacity==8.5.0
tiktoken==0.3.2
tomli==2.0.1
tornado==6.1
traitlets==5.1.1
typing-extensions==4.4.0
urllib3==1.26.14
wcwidth==0.2.5
widgetsnbextension==4.0.11
zipp==3.7.0
Validate Parameters
Validate the parameters including machine_type.
tenant = ""
cli = client.Client(tenant)
cli.login('apikey', secret_id=API_SECRET, key_id=API_KEY)
cli.select_db_schema(schema_name=schema_name)
machine_types = cli.get_machine_type_names(clean_strings_out=False)
if machine_type not in machine_types:
raise RuntimeError(f"machine_type not found: {machine_type}")
Setup ODBC Connection
Connect to the DB using the provided API Key and Secret.
connection_string = f"postgresql+psycopg2://{API_KEY}:{API_SECRET}@pgbouncer:6432/tenant_storage"
engine = sqla.create_engine(connection_string)
Perform Analysis
Run the query and identify columns with null and static fields.
table_name = f"cycle_{machine_type.lower()}"
query = sqla.text(f"""
SELECT column_analysis(:s_name, :t_name, :days) AS column_analysis, count(*) AS total_rows_checked
FROM {engine.dialect.identifier_preparer.quote_schema(schema_name)}.{engine.dialect.identifier_preparer.quote_identifier(table_name)}
WHERE "Cycle End Time" > NOW() - INTERVAL ':days days'
""")
with engine.connect() as conn:
res = conn.execute(query, s_name=schema_name, t_name=table_name, days=days)
data = [dict(row) for row in res]
null_fields = []
static_fields = []
column_analysis = data[0]["column_analysis"][0]
for column_name, distinct_count in column_analysis.items():
if distinct_count == 0:
null_fields.append(column_name)
elif distinct_count == 1:
static_fields.append(column_name)
Format Response
The last executable cell of the notebook contains the response of the UDF.
response = {
schema_name: {
machine_type: {
"days": days,
"total_rows_checked": data[0]["total_rows_checked"],
"null_field_count": len(null_fields),
"null_fields": null_fields,
"static_field_count": len(static_fields),
"static_fields": static_fields
}
}
}
JSON(response)