Custom Parser

Print
Share
Dark
Light

Article summary

Did you find this summary helpful?

Thank you for your feedback

Creating a custom parser

In FactoryTX, a FileReceiver uses a Parser to convert file contents into structured data for the transform and transmit components to process. FactoryTX has a few built-in parsers, and a list of available parsers and their configuration options can be found in the Parsers Configurations section. If none of the available parsers fulfill your needs, you’ll need to create your own.

Defining a Parser

Your custom parser should inherit from the base Parser class and define the following methods:

__init__(config, root_config): This method initializes the parser instance. It accepts two parameters:
- config: The configuration dictionary of the parser from the FactoryTX config. The clean method is used to validate and normalize this configuration, and the connection will only be initialized if there is no ValidationError.
- root_config: The root configuration dictionary, which can be used to access global settings (e.g. Sight Machine cloud credentials). This is the FactoryTX config.
clean(config, root_config): This method validates and normalizes a configuration dictionary. It validates that the configuration it receives is well-formed and will be accepted by the constructor. The method may also modify the configuration, such as inserting default values, and changes made to the config will persist in the configuration passed to the constructor. This method returns a list of ValidationMessage objects.
process(file_entry, state, local_path): Processes a file and converts it into a Pandas DataFrame. It accepts three parameters:
- file_entry: metadata about the file, eg. its name, path on the remote server, last modification time, etc.
- state: None if the file has not been previously processed, or the state dictionary returned the last time this file was processed.
- local_path: path to a temporary file containing the file contents on the local disk. This file will be removed once file has been parsed.

It returns a tuple of a DataFrame containing the data extracted from the file, and a JSON-seriallizable state dictionary.

As mentioned in the Creating a custom transform tutorial, you’ll need to include a schema that can be applied to your parser. This schema defines what properties can be configured for your parser and is what should be used in the clean method for validation.

Stateful processing

When a receiver passes a file to the parser, the receiver includes the current state of the file (or None if the file hasn’t been parsed). The state is a JSON-serializable dictionary that includes file metadata such as the last time the file was processed. When the parser has finished converting the file, it passes back the file’s new state to the receiver for storage in its StateStore.

Stateful processing can be used to implement incremental parsing, which is especially handy for files that are gradually added to. For example, the Excel Parser tracks and refers to a file’s row count. If new rows have been added, the parser will only process the new data and pass it along the data pipeline.

What's Next

Parsers

Table of contents

Creating a custom parser