Built-in Pipelines

Pipelines in ScanCode.io are Python scripts that facilitate code analysis by executing a sequence of steps. The platform provides the following built-in pipelines:

Tip

If you are unsure which pipeline suits your requirements best, check out the Which pipeline should I use? section for guidance.

Pipeline Base Class

class scanpipe.pipelines.Pipeline

Alias for the ProjectPipeline class.

Analyse Docker Image

class scanpipe.pipelines.docker.Docker

Analyze Docker images.

extract_images()

Extract images from input tarballs.

extract_layers()

Extract layers from input images.

find_images_os_and_distro()

Find the operating system and distro of input images.

collect_images_information()

Collect and store image information in a project.

collect_and_create_codebase_resources()

Collect and labels all image files as CodebaseResources.

collect_and_create_system_packages()

Collect installed system packages for each layer based on the distro.

flag_uninteresting_codebase_resources()

Flag files that don’t belong to any system package.

Analyze Root Filesystem or VM Image

class scanpipe.pipelines.root_filesystem.RootFS

Analyze a Linux root filesystem, also known as rootfs.

extract_input_files_to_codebase_directory()

Extract root filesystem input archives with extractcode.

find_root_filesystems()

Find root filesystems in the project’s codebase/.

collect_rootfs_information()

Collect and stores rootfs information on the project.

collect_and_create_codebase_resources()

Collect and label all image files as CodebaseResource.

collect_and_create_system_packages()

Collect installed system packages for each rootfs based on the distro. The collection of system packages is only available for known distros.

flag_uninteresting_codebase_resources()

Flag files—not worth tracking—that don’t belong to any system packages.

scan_for_application_packages()

Scan unknown resources for packages information.

match_not_analyzed_to_system_packages()

Match files with “not-yet-analyzed” status to files already belonging to system packages.

match_not_analyzed_to_application_packages()

Match files with “not-yet-analyzed” status to files already belonging to application packages.

scan_for_files()

Scan unknown resources for copyrights, licenses, emails, and urls.

analyze_scanned_files()

Analyze single file scan results for completeness.

flag_not_analyzed_codebase_resources()

Check for any leftover files for sanity; there should be none.

Analyse Docker Windows Image

class scanpipe.pipelines.docker_windows.DockerWindows

Analyze Windows Docker images.

flag_known_software_packages()

Flag files from known software packages by checking common install paths.

flag_uninteresting_codebase_resources()

Flag files that are known/labelled as uninteresting.

flag_program_files_dirs_as_packages()

Report the immediate subdirectories of Program Files and Program Files (x86) as packages.

flag_data_files_with_no_clues()

Flag data files that have no clues on their origin as uninteresting.

Collect string with Xgettext (addon)

class scanpipe.pipelines.collect_strings_gettext.CollectStringsGettext

Collect source string literals with xgettext.

collect_and_store_resource_strings()

Collect source strings from codebase files using gettext and store them in the extra data field.

Collect symbols with Ctags (addon)

class scanpipe.pipelines.collect_symbols_ctags.CollectSymbolsCtags

Collect source symbols with Ctags.

collect_and_store_resource_symbols()

Collect symbols from codebase files using Ctags and store them in the extra data field.

Collect symbols, string and comments with Pygments (addon)

class scanpipe.pipelines.collect_symbols_pygments.CollectSymbolsPygments

Collect source symbols, string literals and comments with Pygments.

collect_and_store_pygments_symbols_and_strings()

Collect symbols, strings and comments from codebase files using pygments and store them in the extra data field.

Collect symbols and string with Tree-Sitter (addon)

class scanpipe.pipelines.collect_symbols_tree_sitter.CollectSymbolsTreeSitter

Collect source symbols and string literals with Tree-sitter.

collect_and_store_tree_sitter_symbols_and_strings()

Collect symbols and strings from codebase files using tree-sitter and store them in the extra data field.

Enrich With PurlDB (addon)

Warning

This pipeline requires access to a PurlDB service. Refer to PURLDB to configure access to PurlDB in your ScanCode.io instance.

class scanpipe.pipelines.enrich_with_purldb.EnrichWithPurlDB

Enrich the discovered packages with data available in the PurlDB.

enrich_discovered_packages_with_purldb()

Lookup discovered packages in PurlDB.

Find Vulnerabilities (addon)

Warning

This pipeline requires access to a VulnerableCode database. Refer to VULNERABLECODE to configure access to VulnerableCode in your ScanCode.io instance.

class scanpipe.pipelines.find_vulnerabilities.FindVulnerabilities

Find vulnerabilities for packages and dependencies in the VulnerableCode database.

Vulnerability data is stored on each package and dependency instance.

check_vulnerablecode_service_availability()

Check if the VulnerableCode service if configured and available.

lookup_packages_vulnerabilities()

Check for vulnerabilities for each of the project’s discovered package.

lookup_dependencies_vulnerabilities()

Check for vulnerabilities for each of the project’s discovered dependency.

Inspect ELF Binaries (addon)

class scanpipe.pipelines.inspect_elf_binaries.InspectELFBinaries

Inspect ELF binaries and collect DWARF paths.

collect_dwarf_source_path_references()

Collect DWARF paths from ELF files and set values on the extra_data field.

Inspect Packages

class scanpipe.pipelines.inspect_packages.InspectPackages

Inspect a codebase for packages and pre-resolved dependencies.

This pipeline inspects a codebase for application packages and their dependencies using package manifests and dependency lockfiles. It does not resolve dependencies, it does instead collect already pre-resolved dependencies from lockfiles, and direct dependencies (possibly not resolved) as found in package manifests’ dependency sections.

See documentation for the list of supported package manifests and dependency lockfiles: https://scancode-toolkit.readthedocs.io/en/stable/reference/available_package_parsers.html

scan_for_application_packages()

Scan resources for package information to add DiscoveredPackage and DiscoveredDependency objects from detected package data.

resolve_dependencies()

Create packages and dependency relationships from lockfiles or manifests containing pre-resolved dependencies.

Load Inventory

class scanpipe.pipelines.load_inventory.LoadInventory

Load JSON/XLSX inventory files generated with ScanCode-toolkit or ScanCode.io.

Supported format are ScanCode-toolkit JSON scan results, ScanCode.io JSON output, and ScanCode.io XLSX output.

An inventory is composed of packages, dependencies, resources, and relations.

get_inputs()

Locate all the supported input files from the project’s input/ directory.

build_inventory_from_scans()

Process JSON scan results files to populate packages, dependencies, and resources.

Load SBOM

class scanpipe.pipelines.load_sbom.LoadSBOM

Load package data from one or more SBOMs.

Supported SBOMs: - SPDX document - CycloneDX BOM Other formats: - AboutCode .ABOUT files for package curations.

get_sbom_inputs()

Locate all the SBOMs among the codebase resources.

get_packages_from_sboms()

Get packages data from SBOMs.

create_packages_from_sboms()

Create the packages declared in the SBOMs.

create_dependencies_from_sboms()

Create the dependency relationship declared in the SBOMs.

Resolve Dependencies

class scanpipe.pipelines.resolve_dependencies.ResolveDependencies

Resolve dependencies from package manifests and lockfiles.

This pipeline collects lockfiles and manifest files that contain dependency requirements, and resolves these to a concrete set of package versions.

Supports resolving packages for: - Python: using python-inspector, using requirements.txt and setup.py manifests as inputs

get_manifest_inputs()

Locate package manifest files with a supported package resolver.

scan_for_application_packages()

Scan and assemble application packages from package manifests and lockfiles.

create_packages_and_dependencies()

Create the statically resolved packages and their dependencies in the database.

get_packages_from_manifest()

Resolve package data from lockfiles/requirement files with package requirements/dependenices.

create_resolved_packages()

Create the dynamically resolved packages and their dependencies in the database.

Map Deploy To Develop

Warning

This pipeline requires input files to be tagged with the following:

  • “from”: For files related to the source code (also known as “develop”).

  • “to”: For files related to the build/binaries (also known as “deploy”).

Tagging your input files varies based on whether you are using the REST API, UI, or CLI. Refer to the How to tag input files? section for guidance.

class scanpipe.pipelines.deploy_to_develop.DeployToDevelop

Establish relationships between two code trees: deployment and development.

This pipeline requires a minimum of two archive files, each properly tagged with:

  • from for archives containing the development source code.

  • to for archives containing the deployment compiled code.

When using download URLs as inputs, the “from” and “to” tags can be provided by adding a “#from” or “#to” fragment at the end of the download URLs.

When uploading local files:

  • User Interface: Use the “Edit flag” link in the “Inputs” panel of the Project details view.

  • REST API: Utilize the “upload_file_tag” field in addition to the “upload_file”.

  • Command Line Interface: Tag uploaded files using the “filename:tag” syntax, for example, --input-file path/filename:tag.

get_inputs()

Locate the from and to input files.

extract_inputs_to_codebase_directory()

Extract input files to the project’s codebase/ directory.

collect_and_create_codebase_resources()

Collect and create codebase resources.

fingerprint_codebase_directories()

Compute directory fingerprints for matching

flag_whitespace_files()

Flag whitespace files with size less than or equal to 100 byte as ignored.

map_about_files()

Map from/ .ABOUT files to their related to/ resources.

map_checksum()

Map using SHA1 checksum.

match_archives_to_purldb()

Match selected package archives by extension to PurlDB.

find_java_packages()

Find the java package of the .java source files.

map_java_to_class()

Map a .class compiled file to its .java source.

map_jar_to_source()

Map .jar files to their related source directory.

map_javascript()

Map a packed or minified JavaScript, TypeScript, CSS and SCSS to its source.

map_elf()

Map ELF binaries to their sources.

map_go()

Map Go binaries to their sources.

match_directories_to_purldb()

Match selected directories in PurlDB.

match_resources_to_purldb()

Match selected files by extension in PurlDB.

map_javascript_post_purldb_match()

Map minified javascript file based on existing PurlDB match.

map_javascript_path()

Map javascript file based on path.

map_javascript_colocation()

Map JavaScript files based on neighborhood file mapping.

map_thirdparty_npm_packages()

Map thirdparty package using package.json metadata.

map_path()

Map using path similarities.

flag_mapped_resources_archives_and_ignored_directories()

Flag all codebase resources that were mapped during the pipeline.

perform_house_keeping_tasks()
On deployed side
  • PurlDB match files with no-java-source and empty status,

    if no match is found update status to requires-review.

  • Update status for uninteresting files.

  • Flag the dangling legal files for review.

On devel side
  • Update status for not deployed files.

match_purldb_resources_post_process()

Choose the best package for PurlDB matched resources.

remove_packages_without_resources()

Remove packages without any resources.

scan_unmapped_to_files()

Scan unmapped/matched to/ files for copyrights, licenses, emails, and urls and update the status to requires-review.

scan_mapped_from_for_files()

Scan mapped from/ files for copyrights, licenses, emails, and urls.

create_local_files_packages()

Create local-files packages for codebase resources not part of a package.

flag_deployed_from_resources_with_missing_license()

Update the status for deployed from files with missing license.

Match to MatchCode (addon)

Warning

This pipeline requires access to a MatchCode.io service. Refer to MATCHCODE.IO to configure access to MatchCode.io in your ScanCode.io instance.

class scanpipe.pipelines.match_to_matchcode.MatchToMatchCode

Match the codebase resources of a project against MatchCode.io to identify packages.

This process involves:

  1. Generating a JSON scan of the project codebase

  2. Transmitting it to MatchCode.io and awaiting match results

  3. Creating discovered packages from the package data obtained

  4. Associating the codebase resources with those discovered packages

Currently, MatchCode.io can only match for archives, directories, and files from Maven and npm Packages.

This pipeline requires a MatchCode.io instance to be configured and available. There is currently no public instance of MatchCode.io. Reach out to nexB, Inc. for other arrangements.

check_matchcode_service_availability()

Check if the MatchCode.io service if configured and available.

send_project_json_to_matchcode()

Create a JSON scan of the project Codebase and send it to MatchCode.io.

poll_matching_results()

Wait until the match results are ready by polling the match run status.

create_packages_from_match_results()

Create DiscoveredPackages from match results.

Populate PurlDB (addon)

Warning

This pipeline requires access to a PurlDB service. Refer to PURLDB to configure access to PurlDB in your ScanCode.io instance.

class scanpipe.pipelines.populate_purldb.PopulatePurlDB

Populate PurlDB with discovered project packages and their dependencies.

populate_purldb_with_discovered_packages()

Add DiscoveredPackage to PurlDB.

populate_purldb_with_discovered_dependencies()

Add DiscoveredDependency to PurlDB.

Scan Codebase

class scanpipe.pipelines.scan_codebase.ScanCodebase

Scan a codebase for application packages, licenses, and copyrights.

This pipeline does not further scan the files contained in a package for license and copyrights and only considers the declared license of a package. It does not scan for system (Linux distro) packages.

copy_inputs_to_codebase_directory()

Copy input files to the project’s codebase/ directory. The code can also be copied there prior to running the Pipeline.

collect_and_create_codebase_resources()

Collect and create codebase resources.

scan_for_application_packages()

Scan unknown resources for packages information.

scan_for_files()

Scan unknown resources for copyrights, licenses, emails, and urls.

Scan For Virus

class scanpipe.pipelines.scan_for_virus.ScanForVirus

Run a ClamAV scan on the codebase directory to detect virus infection.

scan_for_virus()

Run a ClamAV scan to detect virus infection.

Scan Single Package

class scanpipe.pipelines.scan_single_package.ScanSinglePackage

Scan a single package archive (or package manifest file).

This pipeline scans a single package for package metadata, declared dependencies, licenses, license clarity score and copyrights.

The output is a summary of the scan results in JSON format.

get_package_input()

Locate the package input in the project’s input/ directory.

collect_input_information()

Collect and store information about the project input.

extract_input_to_codebase_directory()

Copy or extract input to project codebase/ directory.

run_scan()

Scan extracted codebase/ content.

load_inventory_from_toolkit_scan()

Process a JSON Scan results to populate codebase resources and packages.

make_summary_from_scan_results()

Build a summary in JSON format from the generated scan results.