Custom Pipelines

Pipelines are Python scripts; each contains a set of instructions that have to be executed in an orderly manner—pipe-like nature—to perform a code analysis.

  • A pipeline is a Python class that lives in a Python module as a .py file.

  • A pipeline class always inherits from the Pipeline base class Pipeline Base Class, or from other existing pipeline classes, such as the Built-in Pipelines.

  • A pipeline defines sequence of steps—execution order of the steps—using the steps classmethod.

See Pipelines for more details.

Pipeline registration

Built-in pipelines are located in scanpipe/pipelines/ directory and are registered during the ScanCode.io installation.

Whereas custom pipelines are added as Python files .py in the directories defined in the SCANCODEIO_PIPELINES_DIRS setting. Custom pipelines are registered at runtime.

Create a Pipeline

Create a new Python file my_pipeline.py, and make sure to include the full path of the new pipeline directory in the SCANCODEIO_PIPELINES_DIRS setting.

from scanpipe.pipelines import Pipeline

class MyPipeline(Pipeline):

    @classmethod
    def steps(cls):
        return (
            cls.step1,
            cls.step2,
        )

    def step1(self):
        pass

    def step2(self):
        pass

Tip

You can view the scanpipe/pipelines/ directory for more pipeline examples.

Modify Existing Pipelines

Existing pipelines are flexible and can be reused as a base for custom pipelines , i.e. be customized. For instance, you can override existing steps, add new ones, or remove any of them.

from scanpipe.pipelines.scan_codebase import ScanCodebase

class MyCustomScan(ScanCodebase):

    @classmethod
    def steps(cls):
        return (
            # Original steps from the ScanCodebase pipeline
            cls.copy_inputs_to_codebase_directory,
            cls.extract_archives,
            cls.collect_and_create_codebase_resources,
            cls.flag_empty_files,
            cls.flag_ignored_resources,
            cls.scan_for_application_packages,
            cls.scan_for_files,

            # My extra steps
            cls.extra_step1,
            cls.extra_step2,
        )

    def extra_step1(self):
        pass

    def extra_step2(self):
        pass

Custom Pipeline Example

The example below shows a custom pipeline that is based on the built-in Scan Codebase pipeline with an extra reporting step.

Add the following code snippet to a Python file and register the path of the file’s directory in the SCANCODEIO_PIPELINES_DIRS.

from collections import defaultdict

from jinja2 import Template

from scanpipe.pipelines.scan_codebase import ScanCodebase


class ScanAndReport(ScanCodebase):
    """
    Runs the ScanCodebase built-in pipeline steps and generate a licenses report.
    """

    @classmethod
    def steps(cls):
        return ScanCodebase.steps() + (
            cls.report_licenses_with_resources,
        )

    # Set to True to extract recursively nested archives in archives.
    extract_recursively = False

    # See https://jinja.palletsprojects.com/en/3.0.x/templates/ for documentation
    report_template = """
    {% for matched_text, paths in resources.items() -%}
        {{ matched_text }}

        {% for path in paths -%}
            {{ path }}
        {% endfor %}

    {% endfor %}
    """

    def report_licenses_with_resources(self):
        """
        Retrieves codebase resources and generates a licenses report file using
        a Jinja template.
        """
        resources = self.project.codebaseresources.has_license_detections()

        resources_by_matched_text = defaultdict(list)
        for resource in resources:
            for detection_data in resource.license_detections:
                for match in detection_data.get("matches", []):
                    matched_text = match.get("matched_text")
                    resources_by_matched_text[matched_text].append(resource.path)

        template = Template(self.report_template, lstrip_blocks=True, trim_blocks=True)
        report_stream = template.stream(resources=resources_by_matched_text)
        report_file = self.project.get_output_file_path("license-report", "txt")
        report_stream.dump(str(report_file))

Pipeline Packaging

Once you created a custom pipeline, you’ll want to package it as a Python module for easier distribution and reuse. You can check the Packaging Python Project tutorial at PyPA, for standard packaging instructions.

After you have packaged your own custom pipeline successfully, you need to specify the entry point of the pipeline in the setup.cfg file.

[options.entry_points]
scancodeio_pipelines =
    pipeline_name = pipeline_module:Pipeline_class

Note

Remember to replace pipeline_module with the name of the Python module containing your custom pipeline.

Pipeline Packaging Example

The example below shows a standard pipeline packaging procedure for the custom pipeline created in Custom Pipeline Example.

A typical directory structure for the Python package would be:

.
├── CHANGELOG.rst
├── LICENSE
├── MANIFEST.in
├── pyproject.toml
├── README.rst
├── setup.cfg
├── setup.py
└── src
    └── scancodeio_scan_and_report_pipeline
        ├── __init__.py
        └── pipelines
            ├── __init__.py
            └── scan_and_report.py

Add the following code snippet to your setup.cfg file and specify the entry point to the pipeline under the [options.entry_points] section.

[metadata]
license_files =
    LICENSE
    CHANGELOG.rst

name = scancodeio_scan_and_report_pipeline
author = nexB. Inc. and others
author_email = info@aboutcode.org
license = Apache-2.0

# description must be on ONE line https://github.com/pypa/setuptools/issues/1390
description =  Generates a licenses report file from a template in ScanCode.io
long_description = file:README.rst
url = https://github.com/nexB/scancode.io
classifiers =
    Development Status :: 4 - Beta
    Intended Audience :: Developers
    Programming Language :: Python :: 3
    Programming Language :: Python :: 3 :: Only
keywords =
    scancodeio
    pipelines

[options]
package_dir=
    =src
packages=find:
include_package_data = true
zip_safe = false
python_requires = >=3.10
setup_requires = setuptools_scm[toml] >= 4

[options.packages.find]
where=src

[options.entry_points]
scancodeio_pipelines =
    pipeline_name = scancodeio_scan_and_report_pipeline.pipelines.scan_and_report:ScanAndReport

Tip

Take a look at Google License Classifier pipeline for ScanCode.io for a complete example on packaging a custom tool as a pipeline.

Pipeline Publishing to PyPI

After successfully packaging a pipeline, you may consider distributing it—as a plugin—via PyPI. Ensure a directory structure similar to the Pipeline Packaging Example with all package files correctly configured.

Tip

See the Python packaging tutorial at PyPA for a detailed setup guide.

Next step involves generating the distribution archives for the package. Make sure you have the latest version of build installed on your system.

pip install --upgrade build

Now run the following command from within the same directory where the pyproject.toml is located:

python -m build

Once completed, you should have two files inside the dist/ directory with the .tar.gz and .whl extensions.

Note

Remember to create an account on PyPI before uploading your distribution archive to PyPI.

You can use twine to upload the package to PyPI. To install twine, run the following command:

pip install twine

Finally, you can upload your package to PyPI with the next command:

twine upload dist/*

Once successfully uploaded, your pipeline package should be viewable on PyPI under the name specified in your manifest.

To make your pipeline available in your instance of ScanCode.io, you need to install the package from PyPI. For example, to install the package described in the Pipeline Packaging Example, run:

bin/pip install scancodeio_scan_and_report_pipeline