Built-in Pipelines

Pipelines in ScanCode.io are Python scripts that facilitate code analysis by executing a sequence of steps. The platform provides the following built-in pipelines:

Tip

If you are unsure which pipeline suits your requirements best, check out the Which pipeline should I use? section for guidance.

Pipeline Base Class

class scanpipe.pipelines.Pipeline

Main class for all pipelines including common step methods.

flag_empty_files()

Flag empty files.

flag_ignored_resources()

Flag ignored resources based on Project ignored_patterns setting.

Deploy To Develop

class scanpipe.pipelines.deploy_to_develop.DeployToDevelop

Establish relationships between two code trees: deployment and development.

This pipeline is expecting 2 archive files with “from-” and “to-” filename prefixes as inputs: - “from-[FILENAME]” archive containing the development source code - “to-[FILENAME]” archive containing the deployment compiled code

get_inputs()

Locate the from and to input files.

extract_inputs_to_codebase_directory()

Extract input files to the project’s codebase/ directory.

extract_archives_in_place()

Extract recursively from* and to* archives in place with extractcode.

collect_and_create_codebase_resources()

Collect and create codebase resources.

fingerprint_codebase_directories()

Compute directory fingerprints for matching

map_about_files()

Map from/ .ABOUT files to their related to/ resources.

map_checksum()

Map using SHA1 checksum.

match_archives_to_purldb()

Match selected package archives by extension to PurlDB.

find_java_packages()

Find the java package of the .java source files.

map_java_to_class()

Map a .class compiled file to its .java source.

map_jar_to_source()

Map .jar files to their related source directory.

map_javascript()

Map a packed or minified JavaScript, TypeScript, CSS and SCSS to its source.

match_directories_to_purldb()

Match selected directories in PurlDB.

match_resources_to_purldb()

Match selected files by extension in PurlDB.

map_javascript_post_purldb_match()

Map minified javascript file based on existing PurlDB match.

map_javascript_path()

Map javascript file based on path.

map_javascript_colocation()

Map JavaScript files based on neighborhood file mapping.

map_thirdparty_npm_packages()

Map thirdparty package using package.json metadata.

map_path()

Map using path similarities.

flag_mapped_resources_archives_and_ignored_directories()

Flag all codebase resources that were mapped during the pipeline.

perform_house_keeping_tasks()
On deployed side
  • PurlDB match files with no-java-source and empty status,

    if no match is found update status to requires-review.

  • Update status for uninteresting files.

On devel side
  • Update status for not deployed files.

scan_unmapped_to_files()

Scan unmapped/matched to/ files for copyrights, licenses, emails, and urls and update the status to requires-review.

scan_mapped_from_for_files()

Scan mapped from/ files for copyrights, licenses, emails, and urls.

create_local_files_packages()

Create local-files packages for codebase resources not part of a package.

flag_deployed_from_resources_with_missing_license()

Update the status for deployed from files with missing license.

Docker Image Analysis

class scanpipe.pipelines.docker.Docker

Analyze Docker images.

extract_images()

Extract images from input tarballs.

extract_layers()

Extract layers from input images.

find_images_os_and_distro()

Find the operating system and distro of input images.

collect_images_information()

Collect and store image information in a project.

collect_and_create_codebase_resources()

Collect and labels all image files as CodebaseResources.

collect_and_create_system_packages()

Collect installed system packages for each layer based on the distro.

flag_uninteresting_codebase_resources()

Flag files that don’t belong to any system package.

Docker Windows Image Analysis

class scanpipe.pipelines.docker_windows.DockerWindows

Analyze Windows Docker images.

flag_known_software_packages()

Flag files from known software packages by checking common install paths.

flag_uninteresting_codebase_resources()

Flag files that are known/labelled as uninteresting.

flag_program_files_dirs_as_packages()

Report the immediate subdirectories of Program Files and Program Files (x86) as packages.

flag_data_files_with_no_clues()

Flag data files that have no clues on their origin as uninteresting.

Find Vulnerabilities

class scanpipe.pipelines.find_vulnerabilities.FindVulnerabilities

Find vulnerabilities for packages and dependencies in the VulnerableCode database.

Vulnerability data is stored on each package and dependency instance.

check_vulnerablecode_service_availability()

Check if the VulnerableCode service if configured and available.

lookup_packages_vulnerabilities()

Check for vulnerabilities for each of the project’s discovered package.

lookup_dependencies_vulnerabilities()

Check for vulnerabilities for each of the project’s discovered dependency.

Inspect Manifest

class scanpipe.pipelines.inspect_manifest.InspectManifest

Inspect one or more manifest files and resolve their associated packages.

Supports: - BOM: SPDX document, CycloneDX BOM, AboutCode ABOUT file - Python: requirements.txt, setup.py, setup.cfg, Pipfile.lock - JavaScript: yarn.lock lockfile, npm package-lock.json lockfile - Java: Java JAR MANIFEST.MF, Gradle build script - Ruby: RubyGems gemspec manifest, RubyGems Bundler Gemfile.lock - Rust: Rust Cargo.lock dependencies lockfile, Rust Cargo.toml package manifest - PHP: PHP composer lockfile, PHP composer manifest - NuGet: nuspec package manifest - Dart: pubspec manifest, pubspec lockfile - OS: FreeBSD compact package manifest, Debian installed packages database

Full list available at https://scancode-toolkit.readthedocs.io/en/ doc-update-licenses/reference/available_package_parsers.html

get_manifest_inputs()

Locate all the manifest files from the project’s input/ directory.

get_packages_from_manifest()

Get packages data from manifest files.

create_resolved_packages()

Create the resolved packages and their dependencies in the database.

Load Inventory From Scan

class scanpipe.pipelines.load_inventory.LoadInventory

Load JSON/XLSX inventory files generated with ScanCode-toolkit or ScanCode.io.

Supported format are ScanCode-toolkit JSON scan results, ScanCode.io JSON output, and ScanCode.io XLSX output.

An inventory is composed of packages, dependencies, resources, and relations.

get_inputs()

Locate all the supported input files from the project’s input/ directory.

build_inventory_from_scans()

Process JSON scan results files to populate packages, dependencies, and resources.

Populate PurlDB

class scanpipe.pipelines.populate_purldb.PopulatePurlDB

Populate PurlDB with discovered project packages and their dependencies.

populate_purldb_with_discovered_packages()

Add DiscoveredPackage to PurlDB.

populate_purldb_with_discovered_dependencies()

Add DiscoveredDependency to PurlDB.

populate_purldb_with_detected_purls()

Add DiscoveredPackage to PurlDB.

Root Filesystem Analysis

class scanpipe.pipelines.root_filesystems.RootFS

Analyze a Linux root filesystem, also known as rootfs.

extract_input_files_to_codebase_directory()

Extract root filesystem input archives with extractcode.

find_root_filesystems()

Find root filesystems in the project’s codebase/.

collect_rootfs_information()

Collect and stores rootfs information on the project.

collect_and_create_codebase_resources()

Collect and label all image files as CodebaseResource.

collect_and_create_system_packages()

Collect installed system packages for each rootfs based on the distro. The collection of system packages is only available for known distros.

flag_uninteresting_codebase_resources()

Flag files—not worth tracking—that don’t belong to any system packages.

scan_for_application_packages()

Scan unknown resources for packages information.

match_not_analyzed_to_system_packages()

Match files with “not-yet-analyzed” status to files already belonging to system packages.

match_not_analyzed_to_application_packages()

Match files with “not-yet-analyzed” status to files already belonging to application packages.

scan_for_files()

Scan unknown resources for copyrights, licenses, emails, and urls.

analyze_scanned_files()

Analyze single file scan results for completeness.

flag_not_analyzed_codebase_resources()

Check for any leftover files for sanity; there should be none.

Scan Codebase

class scanpipe.pipelines.scan_codebase.ScanCodebase

Scan a codebase with ScanCode-toolkit.

If the codebase consists of several packages and dependencies, it will try to resolve and scan those too.

Input files are copied to the project’s codebase/ directory and are extracted in place before running the scan. Alternatively, the code can be manually copied to the project codebase/ directory.

copy_inputs_to_codebase_directory()

Copy input files to the project’s codebase/ directory. The code can also be copied there prior to running the Pipeline.

extract_archives()

Extract archives with extractcode.

collect_and_create_codebase_resources()

Collect and create codebase resources.

scan_for_application_packages()

Scan unknown resources for packages information.

scan_for_files()

Scan unknown resources for copyrights, licenses, emails, and urls.

Scan Codebase Package

class scanpipe.pipelines.scan_codebase_packages.ScanCodebasePackages

Scan a codebase for PURLs without assembling full packages/dependencies.

This Pipeline is intended for gathering PURL information from a codebase without the overhead of full package assembly.

scan_for_application_packages()

Scan unknown resources for packages information.

Scan Package

class scanpipe.pipelines.scan_package.ScanPackage

Scan a single package archive with ScanCode-toolkit.

The output is a summary of the scan results in JSON format.

get_package_archive_input()

Locate the input package archive in the project’s input/ directory.

collect_archive_information()

Collect and store information about the input archive in the project.

extract_archive_to_codebase_directory()

Extract package archive with extractcode.

run_scancode()

Scan extracted codebase/ content.

load_inventory_from_toolkit_scan()

Process a JSON Scan results to populate codebase resources and packages.

make_summary_from_scan_results()

Build a summary in JSON format from the generated scan results.