Built-in Pipelines

As you may already know that pipelines are Python scripts that perform code analysis by executing a sequence of steps. ScanCode.io offers the following built-in—available—pipelines:

Pipeline Base Class

class scanpipe.pipelines.Pipeline

Base class for all pipelines.

__init__(run)

Load the Run and Project instances.

classmethod get_steps()

Raise a deprecation warning when the steps are defined as a tuple instead of a classmethod.

classmethod get_doc()

Get the doc string of this pipeline.

classmethod get_graph()

Return a graph of steps.

classmethod get_info()

Get a dictionary of combined information data about this pipeline.

classmethod get_summary()

Get the doc string summary.

log(message)

Log the given message to the current module logger and Run instance.

execute()

Execute each steps in the order defined on this pipeline class.

add_error(error)

Create a ProjectError record on the current project.

save_errors(*exceptions)

Context manager to save specified exceptions as ProjectError in the database.

Example in a Pipeline step:

with self.save_errors(rootfs.DistroNotFound):

rootfs.scan_rootfs_for_system_packages(self.project, rfs)

Deploy To Develop

class scanpipe.pipelines.deploy_to_develop.DeployToDevelop

Relate deploy and develop code trees.

This pipeline is expecting 2 archive files with “from-” and “to-” filename prefixes as inputs: - “from-[FILENAME]” archive containing the development source code - “to-[FILENAME]” archive containing the deployment compiled code

get_inputs()

Locate the from and to archives.

extract_inputs_to_codebase_directory()

Extract input files to the project’s codebase/ directory.

extract_archives_in_place()

Extract recursively from* and to* archives in place with extractcode.

collect_and_create_codebase_resources()

Collect and create codebase resources.

flag_empty_and_ignored_files()

Flag empty and ignored files using names and extensions.

map_checksum()

Map using SHA1 checksum.

find_java_packages()

Find the java package of the .java source files.

map_java_to_class()

Map a .class compiled file to its .java source.

flag_to_meta_inf_files()

Flag all META-INF/* file of the to/ directory as ignored.

map_jar_to_source()

Map .jar files to their related source directory.

map_javascript()

Map a packed or minified JavaScript, TypeScript, CSS and SCSS to its source.

match_purldb()

Match selected files by extension in PurlDB.

map_path()

Map using path similarities.

flag_mapped_resources_and_ignored_directories()

Flag all codebase resources that were mapped during the pipeline.

scan_mapped_from_for_files()

Scan mapped from/ files for copyrights, licenses, emails, and urls.

Docker Image Analysis

class scanpipe.pipelines.docker.Docker

Analyze Docker images.

extract_images()

Extract images from input tarballs.

extract_layers()

Extract layers from input images.

find_images_os_and_distro()

Find the operating system and distro of input images.

collect_images_information()

Collect and store image information in a project.

collect_and_create_codebase_resources()

Collect and labels all image files as CodebaseResources.

collect_and_create_system_packages()

Collect installed system packages for each layer based on the distro.

tag_uninteresting_codebase_resources()

Flag files that don’t belong to any system package.

Docker Windows Image Analysis

class scanpipe.pipelines.docker_windows.DockerWindows

Analyze Windows Docker images.

tag_known_software_packages()

Flag files from known software packages by checking common install paths.

tag_uninteresting_codebase_resources()

Flag files that are known/labelled as uninteresting.

tag_program_files_dirs_as_packages()

Report the immediate subdirectories of Program Files and Program Files (x86) as packages.

tag_data_files_with_no_clues()

Flag data files that have no clues on their origin as uninteresting.

Find Vulnerabilities

class scanpipe.pipelines.find_vulnerabilities.FindVulnerabilities

Find vulnerabilities for discovered packages in the VulnerableCode database.

Vulnerability data is stored in the extra_data field of each package.

check_vulnerablecode_service_availability()

Check if the VulnerableCode service if configured and available.

lookup_vulnerabilities()

Check for vulnerabilities on each of the project’s discovered package.

Inspect Manifest

class scanpipe.pipelines.inspect_manifest.InspectManifest

Inspect one or more manifest files and resolve its packages.

Supports: - BOM: SPDX document, CycloneDX BOM, AboutCode ABOUT file - Python: requirements.txt, setup.py, setup.cfg, Pipfile.lock - JavaScript: yarn.lock lockfile, npm package-lock.json lockfile - Java: Java JAR MANIFEST.MF, Gradle build script - Ruby: RubyGems gemspec manifest, RubyGems Bundler Gemfile.lock - Rust: Rust Cargo.lock dependencies lockfile, Rust Cargo.toml package manifest - PHP: PHP composer lockfile, PHP composer manifest - NuGet: nuspec package manifest - Dart: pubspec manifest, pubspec lockfile - OS: FreeBSD compact package manifest, Debian installed packages database

Full list available at https://scancode-toolkit.readthedocs.io/en/ doc-update-licenses/reference/available_package_parsers.html

get_manifest_inputs()

Locate all the manifest files from the project’s input/ directory.

get_packages_from_manifest()

Get packages data from manifest files.

create_resolved_packages()

Create the resolved packages and their dependencies in the database.

Load Inventory From Scan

class scanpipe.pipelines.load_inventory.LoadInventory

Load JSON/XLSX inventory files generated with ScanCode-toolkit or ScanCode.io.

Supported format are ScanCode-toolkit JSON scan results, ScanCode.io JSON output, and ScanCode.io XLSX output.

An inventory is composed of packages, dependencies, resources, and relations.

get_inputs()

Locate all the supported input files from the project’s input/ directory.

build_inventory_from_scans()

Process JSON scan results files to populate packages, dependencies, and resources.

Root Filesystem Analysis

class scanpipe.pipelines.root_filesystems.RootFS

Analyze a Linux root filesystem, aka rootfs.

extract_input_files_to_codebase_directory()

Extract root filesystem input archives with extractcode.

find_root_filesystems()

Find root filesystems in the project’s codebase/.

collect_rootfs_information()

Collect and stores rootfs information in the project.

collect_and_create_codebase_resources()

Collect and label all image files as CodebaseResource.

collect_and_create_system_packages()

Collect installed system packages for each rootfs based on the distro. The collection of system packages is only available for known distros.

tag_uninteresting_codebase_resources()

Flag files—not worth tracking—that don’t belong to any system packages.

tag_empty_files()

Flag empty files.

scan_for_application_packages()

Scan unknown resources for packages information.

match_not_analyzed_to_system_packages()

Match files with “not-yet-analyzed” status to files already belonging to system packages.

match_not_analyzed_to_application_packages()

Match files with “not-yet-analyzed” status to files already belonging to application packages.

scan_for_files()

Scan unknown resources for copyrights, licenses, emails, and urls.

analyze_scanned_files()

Analyze single file scan results for completeness.

tag_not_analyzed_codebase_resources()

Check for any leftover files for sanity; there should be none.

Scan Codebase

class scanpipe.pipelines.scan_codebase.ScanCodebase

Scan a codebase with ScanCode-toolkit.

If the codebase consists of several packages and dependencies, it will try to resolve and scan those too.

Input files are copied to the project’s codebase/ directory and are extracted in place before running the scan. Alternatively, the code can be manually copied to the project codebase/ directory.

copy_inputs_to_codebase_directory()

Copy input files to the project’s codebase/ directory. The code can also be copied there prior to running the Pipeline.

extract_archives()

Extract archives with extractcode.

collect_and_create_codebase_resources()

Collect and create codebase resources.

tag_empty_files()

Flag empty files.

scan_for_application_packages()

Scan unknown resources for packages information.

scan_for_files()

Scan unknown resources for copyrights, licenses, emails, and urls.

Scan Package

class scanpipe.pipelines.scan_package.ScanPackage

Scan a single package archive with ScanCode-toolkit.

The output is a summary of the scan results in JSON format.

get_package_archive_input()

Locate the input package archive in the project’s input/ directory.

collect_archive_information()

Collect and store information about the input archive in the project.

extract_archive_to_codebase_directory()

Extract package archive with extractcode.

run_scancode()

Scan extracted codebase/ content.

load_inventory_from_toolkit_scan()

Process a JSON Scan results to populate codebase resources and packages.

make_summary_from_scan_results()

Build a summary in JSON format from the generated scan results.