User Defined Functions (UDF)
    • Dark
      Light

    User Defined Functions (UDF)

    • Dark
      Light

    Article summary

    Introduction

    This tutorial guides through the process of creating user-defined functions. These functions facilitate the development of custom business logic, which can be stored as an API. This API can then serve as a building block for an application or as a tool for integrating information with other enterprise systems.

    Creating A User Defined Function

    Creating from a Template

    Step 1: Open the Hamburger Menu in the upper right corner.

    Step 2: Select the Dev Panel option from the menu.

    Please note, this option is only accessible to users with a developer role or higher.

    Step 3: Locate and select the User Defined Notebook API. This is where the list of user-defined functions is housed.

    In this environment, no existing or saved user-defined functions are present. It provides instructions on how to access ODBC within the notebook and how to execute it.

    Step 4: Download the template file. This file, a Jupyter notebook, includes an example of creating a user-defined function.

    Saving a New UDF

    This guide explains how to download a Jupyter notebook file, modify it, and upload it as a user-defined function.

    Downloading and Accessing the Notebook File

    Step 1: Download the notebook file as a template.

    Step 2: Open the terminal and start your Jupyter notebook.

    Step 3: Navigate to the location of the downloaded file. For this example, the file is in the Downloads folder.

    Modifying the Notebook File

    Step 1: After opening the template, you will notice 3 distinct sections. 

    Section 1: Initialize Parameters

    The first section initializes parameters and contains a parameter tag. These can be modified by passing in new values to the API, such as date parameters or different machines.  In the example below, a, b, and c are all parameters initialized in the first cell.



    Section 2: Business Logic

    The next section contains the business logic.  This is a simple example and in practice this can take up as many cells as is necessary.



    Section 3: API Response

    The final cell returns the output via the API response. 



    Step 1: Modify the template as needed.


    Step 2: Execute the code to see the expected result. In this example, without passing any parameters, the API response should be 2.



    Step 3: Save your notebook.



    Uploading the Modified Notebook File

    Step 1: Return to the developer page and choose the saved file from the appropriate location.

    Step 2: Provide a name for the UDF. This will be the name of the API that's being called.

    Step 3: Provide a description for the function.

    Step 4: Select "upload". A confirmation message will appear indicating a successful upload.

    The new function is now visible, complete with its description and the time it was last saved, and ready to run!

    Running Your UDF

    1. After uploading, an api request can be POSTed to {environment_name}.sightmachine.io/v1/udf/task/async with the following body format (parameters are optional):


      {
          "name": <notebook name>,
          "parameters": {
              <param>: <value>,
              ...
          }
      }
    2. You will receive acknowledgement of a successful submission along with a task_id. You can then poll for your results at {environment_name}.sightmachine.io/v1/udf/task/async/{task_id}

    Example UDF

    Data Quality

    The following UDF analyzes a given machine_type to find fields which have recently been all null or only a single static non-null value.

    Default Parameters

    The first executable cell of the notebook, containing the parameters tag, sets the default parameter values for the API.

    API_KEY = ""
    API_SECRET = ""
    machine_type = ""
    days = 5
    include_last_non_null = False
    schema_name = ""

    Imports

    Import packages required for your UDF.

    import sqlalchemy as sqla
    from smsdk import client
    from IPython.display import JSON

    NOTE:

    Only some packages are available for User Defined Functions.  See the list below for currently available packages.  If new packages are needed, please reach out to your customer success representative.

    Available Packages

    UDFs only support the following packages:

    • asttokens==2.4.1

    • attrs==21.4.0

    • backcall==0.2.0

    • boto3==1.26.67

    • botocore==1.29.67

    • cerberus==1.3.4

    • certifi==2023.7.22

    • charset-normalizer==3.0.1

    • comm==0.2.2

    • coverage==7.1.0

    • debugpy==1.5.1

    • decorator==5.1.1

    • deepdiff==6.3.0

    • dill==0.3.8

    • entrypoints==0.4

    • exceptiongroup==1.2.2

    • executing==2.0.1

    • git+https://github.com/sightmachine/sightmachine-sdk.git@0fa400d0158cbc76b678802590abf9471c71ea0e

    • google-api-core==1.22.2

    • google-cloud-core==1.4.1

    • google-cloud-storage==1.31.2

    • google-crc32c==1.1.0

    • google-resumable-media==1.2.0

    • googleapis-common-protos==1.52.0

    • google-auth==1.21.3

    • greenlet==0.4.14

    • gunicorn==22.0.0

    • idna==3.7

    • importlib-resources==5.10.2

    • iniconfig==2.0.0

    • ipykernel==6.9.2

    • ipython==7.32.0

    • ipython-genutils==0.2.0

    • ipywidgets==8.1.1

    • jedi==0.18.2

    • joblib==1.2.0

    • jupyter-client==7.1.2

    • jupyter-core==4.11.2

    • jupyterlab-widgets==3.0.11

    • Levenshtein==0.23.0

    • matplotlib==3.7.1

    • matplotlib-inline==0.1.3

    • mock==4.0.3

    • multiprocess==0.70.16

    • nest-asyncio==1.5.4

    • numpy==1.21.6

    • openai==0.27.2

    • packaging==23.0

    • pandas==1.1.5

    • parso==0.8.3

    • pathos==0.3.2

    • pexpect==4.8.0

    • pickleshare==0.7.5

    • platformdirs==4.2.2

    • plotly==5.18.0

    • pluggy==1.0.0

    • pox==0.3.4

    • ppft==1.7.6.8

    • prompt-toolkit==3.0.28

    • psutil==5.9.4

    • psycopg2==2.9.5

    • ptyprocess==0.7.0

    • pure-eval==0.2.3

    • pyOpenSSL==24.1.0

    • pyparsing==3.1.2

    • PyYAML==6.0

    • pyzmq==25.0.0

    • Pygments==2.11.2

    • pytest==7.2.1

    • python-dateutil==2.8.2

    • pytz==2022.7.1

    • requests==2.28.2

    • ruamel.yaml==0.18.6

    • scikit-learn==1.0.2

    • scipy==1.9.3

    • six==1.16.0

    • SQLAlchemy==1.4.32

    • stack-data==0.6.3

    • swifter==1.2.0

    • tenacity==8.5.0

    • tiktoken==0.3.2

    • tomli==2.0.1

    • tornado==6.1

    • traitlets==5.1.1

    • typing-extensions==4.4.0

    • urllib3==1.26.14

    • wcwidth==0.2.5

    • widgetsnbextension==4.0.11

    • zipp==3.7.0

    Validate Parameters

    Validate the parameters including machine_type.

    tenant = ""
    cli = client.Client(tenant)
    cli.login('apikey', secret_id=API_SECRET, key_id=API_KEY)
    cli.select_db_schema(schema_name=schema_name)
    machine_types = cli.get_machine_type_names(clean_strings_out=False)
    if machine_type not in machine_types:
        raise RuntimeError(f"machine_type not found: {machine_type}")

    Setup ODBC Connection

    Connect to the DB using the provided API Key and Secret.

    connection_string = f"postgresql+psycopg2://{API_KEY}:{API_SECRET}@pgbouncer:6432/tenant_storage"
    engine = sqla.create_engine(connection_string)

    Perform Analysis

    Run the query and identify columns with null and static fields.

    table_name = f"cycle_{machine_type.lower()}"
    query = sqla.text(f"""
    SELECT column_analysis(:s_name, :t_name, :days) AS column_analysis, count(*) AS total_rows_checked
    FROM {engine.dialect.identifier_preparer.quote_schema(schema_name)}.{engine.dialect.identifier_preparer.quote_identifier(table_name)}
    WHERE "Cycle End Time" > NOW() - INTERVAL ':days days'
    """)
    with engine.connect() as conn:
        res = conn.execute(query, s_name=schema_name, t_name=table_name, days=days)
        data = [dict(row) for row in res]
    
    null_fields = []
    static_fields = []
    column_analysis = data[0]["column_analysis"][0]
    for column_name, distinct_count in column_analysis.items():
        if distinct_count == 0:
            null_fields.append(column_name)
        elif distinct_count == 1:
            static_fields.append(column_name)

    Format Response

    The last executable cell of the notebook contains the response of the UDF.

    response = {
        schema_name: {
            machine_type: {
                "days": days,
                "total_rows_checked": data[0]["total_rows_checked"],
                "null_field_count": len(null_fields),
                "null_fields": null_fields,
                "static_field_count": len(static_fields),
                "static_fields": static_fields
            }
        }
    }
    JSON(response)