Built-in Pipelines

Pipelines in ScanCode.io are Python scripts that facilitate code analysis by executing a sequence of steps. The platform provides the following built-in pipelines:

Note

Some pipelines have optional steps which are enabled only when they are selected explicitly.

Tip

If you are unsure which pipeline suits your requirements best, check out the Which pipeline should I use? section for guidance.

Pipeline Base Class

class scanpipe.pipelines.Pipeline: Alias for the ProjectPipeline class.

Analyse Docker Image

class scanpipe.pipelines.analyze_docker.Docker

Analyze Docker images.

extract_images(): Extract images from input tarballs.

extract_layers(): Extract layers from input images.

find_images_os_and_distro(): Find the operating system and distro of input images.

collect_images_information(): Collect and store image information in a project.

collect_and_create_codebase_resources(): Collect and labels all image files as CodebaseResources.

collect_and_create_system_packages(): Collect installed system packages for each layer based on the distro.

flag_uninteresting_codebase_resources(): Flag files that don’t belong to any system package.

Analyze Root Filesystem or VM Image

class scanpipe.pipelines.analyze_root_filesystem.RootFS

Analyze a Linux root filesystem, also known as rootfs.

extract_input_files_to_codebase_directory(): Extract root filesystem input archives with extractcode.

find_root_filesystems(): Find root filesystems in the project’s codebase/.

collect_rootfs_information(): Collect and stores rootfs information on the project.

collect_and_create_codebase_resources(): Collect and label all image files as CodebaseResource.

collect_and_create_system_packages(): Collect installed system packages for each rootfs based on the distro. The collection of system packages is only available for known distros.

flag_uninteresting_codebase_resources(): Flag files—not worth tracking—that don’t belong to any system packages.

scan_for_application_packages(): Scan unknown resources for packages information.

match_not_analyzed_to_system_packages(): Match files with “not-yet-analyzed” status to files already belonging to system packages.

match_not_analyzed_to_application_packages(): Match files with “not-yet-analyzed” status to files already belonging to application packages.

scan_for_files(): Scan unknown resources for copyrights, licenses, emails, and urls.

collect_and_create_license_detections(): Collect and create unique license detections from resources and package data.

analyze_scanned_files(): Analyze single file scan results for completeness.

flag_not_analyzed_codebase_resources(): Check for any leftover files for sanity; there should be none.

Analyse Docker Windows Image

class scanpipe.pipelines.analyze_docker_windows.DockerWindows

Analyze Windows Docker images.

flag_known_software_packages(): Flag files from known software packages by checking common install paths.

flag_uninteresting_codebase_resources(): Flag files that are known/labelled as uninteresting.

flag_program_files_dirs_as_packages(): Report the immediate subdirectories of Program Files and Program Files (x86) as packages.

flag_data_files_with_no_clues(): Flag data files that have no clues on their origin as uninteresting.

Benchmark PURLs (addon)

To check an SBOM against a list of expected Package URLs (PURLs):

Create a new project and provide two inputs:
- The SBOM file you want to check.
- A list of expected PURLs in a *-purls.txt file with one PURL per line.
  
  Tip
  
  You may also flag any filename using the purls input tag.
Run the pipelines:
- Select and run the load_sbom pipeline to load the SBOM.
- Run the benchmark_purls pipeline to validate against the expected PURLs.
Download the results from the “output” section of the project.

The output file contains only the differences between the discovered PURLs and the expected PURLs:

Lines starting with - are missing from the project.
Lines starting with + are unexpected in the project.

Note

The load_sbom pipeline is provided as an example to benchmark external tools using SBOMs as inputs. You can also run benchmark_purls directly after any ScanCode.io pipeline to validate the discovered PURLs.

Tip

You can provide multiple expected PURLs files.

class scanpipe.pipelines.benchmark_purls.BenchmarkPurls

Validate discovered project packages against a reference list of expected PURLs.

The expected PURLs must be provided as a .txt file with one PURL per line. Input files are recognized if:

They are tagged with “purls”, or
Their filename ends with “purls.txt” (e.g., “expected_purls.txt”).

get_expected_purls(): Load the expected PURLs defined in the project inputs.

compare_purls(): Run the PURLs diff and write the results to a project output file.

Collect string with Xgettext (addon)

class scanpipe.pipelines.collect_strings_gettext.CollectStringsGettext

Collect source string literals with xgettext.

collect_and_store_resource_strings(): Collect source strings from codebase files using gettext and store them in the extra data field.

Collect symbols with Ctags (addon)

class scanpipe.pipelines.collect_symbols_ctags.CollectSymbolsCtags

Collect source symbols with Ctags.

collect_and_store_resource_symbols(): Collect symbols from codebase files using Ctags and store them in the extra data field.

Collect symbols, string and comments with Pygments (addon)

class scanpipe.pipelines.collect_symbols_pygments.CollectSymbolsPygments

Collect source symbols, string literals and comments with Pygments.

collect_and_store_pygments_symbols_and_strings(): Collect symbols, strings and comments from codebase files using pygments and store them in the extra data field.

Collect symbols and string with Tree-Sitter (addon)

class scanpipe.pipelines.collect_symbols_tree_sitter.CollectSymbolsTreeSitter

Collect source symbols and string literals with Tree-sitter.

collect_and_store_tree_sitter_symbols_and_strings(): Collect symbols and strings from codebase files using tree-sitter and store them in the extra data field.

Enrich With PurlDB (addon)

Warning

This pipeline requires access to a PurlDB service. Refer to PURLDB to configure access to PurlDB in your ScanCode.io instance.

class scanpipe.pipelines.enrich_with_purldb.EnrichWithPurlDB

Enrich the discovered packages with data available in the PurlDB.

enrich_discovered_packages_with_purldb(): Lookup discovered packages in PurlDB.

Find Vulnerabilities (addon)

Warning

This pipeline requires access to a VulnerableCode database. Refer to VULNERABLECODE to configure access to VulnerableCode in your ScanCode.io instance.

class scanpipe.pipelines.find_vulnerabilities.FindVulnerabilities

Find vulnerabilities for packages and dependencies in the VulnerableCode database.

Vulnerability data is stored on each package and dependency instance.

check_vulnerablecode_service_availability(): Check if the VulnerableCode service if configured and available.

lookup_packages_vulnerabilities(): Check for vulnerabilities for each of the project’s discovered package.

lookup_dependencies_vulnerabilities(): Check for vulnerabilities for each of the project’s discovered dependency.

Inspect ELF Binaries (addon)

class scanpipe.pipelines.inspect_elf_binaries.InspectELFBinaries

Inspect ELF binaries and collect DWARF paths.

collect_dwarf_source_path_references(): Collect DWARF paths from ELF files and set values on the extra_data field.

Inspect Packages

class scanpipe.pipelines.inspect_packages.InspectPackages

Inspect a codebase for packages and pre-resolved dependencies.

This pipeline inspects a codebase for application packages and their dependencies using package manifests and dependency lockfiles. It does not resolve dependencies, it does instead collect already pre-resolved dependencies from lockfiles, and direct dependencies (possibly not resolved) as found in package manifests’ dependency sections.

See documentation for the list of supported package manifests and dependency lockfiles: https://scancode-toolkit.readthedocs.io/en/stable/reference/available_package_parsers.html

scan_binaries_for_package()

Optional step: Compiled

Scan compiled binaries for package and dependency related data’ Currently supported compiled binaries: Go, Rust.

scan_for_application_packages(): Scan resources for package information to add DiscoveredPackage and DiscoveredDependency objects from detected package data.

resolve_dependencies()

Optional step: StaticResolver

Create packages and dependency relationships from lockfiles or manifests containing pre-resolved dependencies.

Load Inventory

class scanpipe.pipelines.load_inventory.LoadInventory

Load JSON/XLSX inventory files generated with ScanCode-toolkit or ScanCode.io.

Supported format are ScanCode-toolkit JSON scan results, ScanCode.io JSON output, and ScanCode.io XLSX output.

An inventory is composed of packages, dependencies, resources, and relations.

get_inputs(): Locate all the supported input files from the project’s input/ directory.

build_inventory_from_scans(): Process JSON scan results files to populate packages, dependencies, and resources.

Load SBOM

class scanpipe.pipelines.load_sbom.LoadSBOM

Load package data from one or more SBOMs.

Supported SBOMs: - SPDX document - CycloneDX BOM Other formats: - AboutCode .ABOUT files for package curations.

get_sbom_inputs(): Locate all the SBOMs among the codebase resources.

get_data_from_sboms(): Get data from SBOMs.

create_packages_from_sboms(): Create the packages declared in the SBOMs.

create_dependencies_from_sboms(): Create the dependency relationship declared in the SBOMs.

Resolve Dependencies

class scanpipe.pipelines.resolve_dependencies.ResolveDependencies

Resolve dependencies from package manifests and lockfiles.

This pipeline collects lockfiles and manifest files that contain dependency requirements, and resolves these to a concrete set of package versions.

Supports resolving packages for: - Python: using python-inspector, using requirements.txt and setup.py manifests as inputs

get_manifest_inputs(): Locate package manifest files with a supported package resolver.

scan_for_application_packages()

Optional step: StaticResolver

Scan and assemble application packages from package manifests and lockfiles.

create_packages_and_dependencies()

Optional step: StaticResolver

Create the statically resolved packages and their dependencies in the database.

get_packages_from_manifest()

Optional step: DynamicResolver

Resolve package data from lockfiles/requirement files with package requirements/dependencies.

create_resolved_packages()

Optional step: DynamicResolver

Create the dynamically resolved packages and their dependencies in the database.

Map Deploy To Develop

Warning

This pipeline requires input files to be tagged with the following:

“from”: For files related to the source code (also known as “develop”).
“to”: For files related to the build/binaries (also known as “deploy”).

Tagging your input files varies based on whether you are using the REST API, UI, or CLI. Refer to the How to tag input files? section for guidance.

class scanpipe.pipelines.deploy_to_develop.DeployToDevelop

Establish relationships between two code trees: deployment and development.

This pipeline requires a minimum of two archive files, each properly tagged with:

from for archives containing the development source code.
to for archives containing the deployment compiled code.

When using download URLs as inputs, the “from” and “to” tags can be provided by adding a “#from” or “#to” fragment at the end of the download URLs.

When uploading local files:

User Interface: Use the “Edit flag” link in the “Inputs” panel of the Project details view.
REST API: Utilize the “upload_file_tag” field in addition to the “upload_file”.
Command Line Interface: Tag uploaded files using the “filename:tag” syntax, for example, --input-file path/filename:tag.

get_inputs(): Locate the from and to input files.

extract_inputs_to_codebase_directory(): Extract input files to the project’s codebase/ directory.

collect_and_create_codebase_resources(): Collect and create codebase resources.

fingerprint_codebase_directories(): Compute directory fingerprints for matching

flag_whitespace_files(): Flag whitespace files with size less than or equal to 100 byte as ignored.

load_ecosystem_config(): Load ecosystem specific configurations for d2d steps for selected options.

map_ruby()

Optional step: Ruby

Load Ruby specific configurations for d2d steps.

map_about_files(): Map from/ .ABOUT files to their related to/ resources.

map_checksum(): Map using SHA1 checksum.

match_archives_to_purldb(): Match selected package archives by extension to PurlDB.

find_java_packages()

Optional step: Java

Find the java package of the .java source files.

map_java_to_class()

Optional step: Java

Map a .class compiled file to its .java source.

map_jar_to_java_source()

Optional step: Java

Map .jar files to their related source directory.

find_scala_packages()

Optional step: Scala

Find the java package of the .scala source files.

map_scala_to_class()

Optional step: Scala

Map a .class compiled file to its .scala source.

map_jar_to_scala_source()

Optional step: Scala

Map .jar files to their related source directory.

find_kotlin_packages()

Optional step: Kotlin

Find the java package of the kotlin source files.

map_kotlin_to_class()

Optional step: Kotlin

Map a .class compiled file to its kotlin source.

map_jar_to_kotlin_source()

Optional step: Kotlin

Map .jar files to their related source directory.

find_grammar_packages()

Optional step: Antlr

Find the java package of the .g/.g4 source files.

map_grammar_to_class()

Optional step: Antlr

Map a .class compiled file to its .g/.g4 source.

map_jar_to_grammar_source()

Optional step: Antlr

Map .jar files to their related source directory.

find_groovy_packages()

Optional step: Groovy

Find the package of the .groovy source files.

map_groovy_to_class()

Optional step: Groovy

Map a .class compiled file to its .groovy source.

map_jar_to_groovy_source()

Optional step: Groovy

Map .jar files to their related source directory.

find_aspectj_packages()

Optional step: AspectJ

Find the package of the .aj source files.

map_aspectj_to_class()

Optional step: AspectJ

Map a .class compiled file to its .aj source.

map_jar_to_aspectj_source()

Optional step: AspectJ

Map .jar files to their related source directory.

find_clojure_packages()

Optional step: Clojure

Find the package of the .clj source files.

map_clojure_to_class()

Optional step: Clojure

Map a .class compiled file to its .clj source.

map_jar_to_clojure_source()

Optional step: Clojure

Map .jar files to their related source directory.

find_xtend_packages()

Optional step: Xtend

Find the java package of the xtend source files.

map_xtend_to_class()

Optional step: Xtend

Map a .class compiled file to its xtend source.

map_javascript()

Optional step: JavaScript

Map a packed or minified JavaScript, TypeScript, CSS and SCSS to its source.

map_javascript_symbols()

Optional step: JavaScript

Map deployed JavaScript, TypeScript to its sources using symbols.

map_javascript_strings()

Optional step: JavaScript

Map deployed JavaScript, TypeScript to its sources using string literals.

get_symbols_from_binaries(): Extract symbols from Elf, Mach0 and windows binaries for mapping.

map_elf()

Optional step: Elf

Map ELF binaries to their sources using dwarf paths and symbols.

map_macho()

Optional step: MacOS

Map mach0 binaries to their sources using symbols.

map_winpe()

Optional step: Windows

Map winpe binaries to their sources using symbols.

map_go()

Optional step: Go

Map Go binaries to their sources using paths and symbols.

map_rust()

Optional step: Rust

Map Rust binaries to their sources using symbols.

map_python()

Optional step: Python

Map binaries from Python packages to their sources using dwarf paths and symbols.

match_directories_to_purldb(): Match selected directories in PurlDB.

match_resources_to_purldb(): Match selected files by extension in PurlDB.

map_javascript_post_purldb_match()

Optional step: JavaScript

Map minified javascript file based on existing PurlDB match.

map_javascript_path()

Optional step: JavaScript

Map javascript file based on path.

map_javascript_colocation()

Optional step: JavaScript

Map JavaScript files based on neighborhood file mapping.

map_thirdparty_npm_packages()

Optional step: JavaScript

Map thirdparty package using package.json metadata.

map_path(): Map using path similarities.

flag_mapped_resources_archives_and_ignored_directories(): Flag all codebase resources that were mapped during the pipeline.

perform_house_keeping_tasks()

On deployed side

Ignore specific files based on ecosystem based configurations.
PurlDB match files with no-java-source and empty status,
if no match is found update status to requires-review.
Update status for uninteresting files.
Flag the dangling legal files for review.

On devel side

Update status for not deployed files.

match_purldb_resources_post_process(): Choose the best package for PurlDB matched resources.

remove_packages_without_resources(): Remove packages without any resources.

scan_ignored_to_files(): Scan status=”ignored-from-config” to/ files for copyrights, licenses, emails, and urls. These files are ignored based on ecosystem specific configurations. These files are not used for the D2D purpose, but scanning them may provide useful information about the deployed codebase.

scan_unmapped_to_files(): Scan unmapped/matched to/ files for copyrights, licenses, emails, and urls and update the status to requires-review.

scan_mapped_from_for_files(): Scan mapped from/ files for copyrights, licenses, emails, and urls.

collect_and_create_license_detections(): Collect and create unique license detections from resources and package data.

create_local_files_packages(): Create local-files packages for codebase resources not part of a package.

flag_deployed_from_resources_with_missing_license(): Update the status for deployed from files with missing license.

Match to MatchCode (addon)

Warning

This pipeline requires access to a MatchCode.io service. Refer to MATCHCODE.IO to configure access to MatchCode.io in your ScanCode.io instance.

class scanpipe.pipelines.match_to_matchcode.MatchToMatchCode

Match the codebase resources of a project against MatchCode.io to identify packages.

This process involves:

Generating a JSON scan of the project codebase
Transmitting it to MatchCode.io and awaiting match results
Creating discovered packages from the package data obtained
Associating the codebase resources with those discovered packages

Currently, MatchCode.io can only match for archives, directories, and files from Maven and npm Packages.

This pipeline requires a MatchCode.io instance to be configured and available. There is currently no public instance of MatchCode.io. Reach out to nexB, Inc. for other arrangements.

check_matchcode_service_availability(): Check if the MatchCode.io service if configured and available.

send_project_json_to_matchcode(): Create a JSON scan of the project Codebase and send it to MatchCode.io.

poll_matching_results(): Wait until the match results are ready by polling the match run status.

create_packages_from_match_results(): Create DiscoveredPackages from match results.

Populate PurlDB (addon)

Warning

This pipeline requires access to a PurlDB service. Refer to PURLDB to configure access to PurlDB in your ScanCode.io instance.

class scanpipe.pipelines.populate_purldb.PopulatePurlDB

Populate PurlDB with discovered project packages and their dependencies.

populate_purldb_with_discovered_packages(): Add DiscoveredPackage to PurlDB.

populate_purldb_with_discovered_dependencies(): Add DiscoveredDependency to PurlDB.

Publish To FederatedCode (addon)

Warning

This pipeline requires access to a FederatedCode service. Refer to FEDERATEDCODE to configure access to FederatedCode in your ScanCode.io instance.

class scanpipe.pipelines.publish_to_federatedcode.PublishToFederatedCode

Publish package scan to FederatedCode.

This pipeline commits the project scan result in FederatedCode Git repository. It uses Project PURL to determine the Git repository and the exact directory path where the scan should be stored.

check_federatedcode_eligibility(): Check if the project fulfills the following criteria for pushing the project result to FederatedCode.

get_package_repository(): Get the Git repository URL and scan path for a given package.

clone_repository(): Clone repository to local_path.

add_scan_result(): Add package scan result to the local Git repository.

commit_and_push_changes(): Commit and push changes to remote repository.

delete_working_dir(): Remove temporary working dir.

Scan Codebase

class scanpipe.pipelines.scan_codebase.ScanCodebase

Scan a codebase for application packages, licenses, and copyrights.

This pipeline does not further scan the files contained in a package for license and copyrights and only considers the declared license of a package. It does not scan for system (Linux distro) packages.

copy_inputs_to_codebase_directory(): Copy input files to the project’s codebase/ directory. The code can also be copied there prior to running the Pipeline.

collect_and_create_codebase_resources(): Collect and create codebase resources.

scan_for_application_packages(): Scan unknown resources for packages information.

scan_for_files(): Scan unknown resources for copyrights, licenses, emails, and urls.

collect_and_create_license_detections(): Collect and create unique license detections from resources and package data.

Scan For Virus

class scanpipe.pipelines.scan_for_virus.ScanForVirus

Run a ClamAV scan on the codebase directory to detect virus infection.

scan_for_virus(): Run a ClamAV scan to detect virus infection.

Scan Single Package

class scanpipe.pipelines.scan_single_package.ScanSinglePackage

Scan a single package archive (or package manifest file).

This pipeline scans a single package for package metadata, declared dependencies, licenses, license clarity score and copyrights.

The output is a summary of the scan results in JSON format.

get_package_input(): Locate the package input in the project’s input/ directory.

collect_input_information(): Collect and store information about the project input.

extract_input_to_codebase_directory(): Copy or extract input to project codebase/ directory.

run_scan(): Scan extracted codebase/ content.

load_inventory_from_toolkit_scan(): Process a JSON Scan results to populate codebase resources and packages.

make_summary_from_scan_results(): Build a summary in JSON format from the generated scan results.

Fetch Scores (addon)

Warning

This pipeline is preconfigured to access the “OpenSSF Scorecard API” available at https://api.securityscorecards.dev/

class scanpipe.pipelines.fetch_scores.FetchScores

Fetch ScoreCode information for packages.

This pipeline retrieves ScoreCode data for each package in the project and stores it in the corresponding package instances.

ScoreCode data refers to metadata retrieved from the OpenSSF Scorecard tool, which evaluates open source packages based on security and quality checks. This data includes an overall score, individual check results (such as use of branch protection, fuzzing, dependency updates, etc.), the version of the scoring tool used, and the date of evaluation

check_scorecode_service_availability(): Check if the ScoreCode service is configured and available.

fetch_packages_scorecode_info(): Fetch ScoreCode information for each of the project’s discovered packages.

evaluate_compliance_alerts(): Evaluate scorecard compliance alerts for the project.