Pipes

Generic

scanpipe.pipes.make_codebase_resource(project, location, **extra_fields)

Creates a CodebaseResource instance in the database for the given project.

The provided location is the absolute path of this resource. It must be rooted in project.codebase_path as only the relative path within the project codebase/ directory is stored in the database.

Extra fields can be provided as keywords arguments to this function call:

>>> make_codebase_resource(
>>>     project=project,
>>>     location=resource.location,
>>>     rootfs_path=resource.path,
>>>     tag=layer_tag,
>>> )

In this example, rootfs_path is an optional path relative to a rootfs root within an Image/VM filesystem context. e.g.: “/var/log/file.log”

All paths use the POSIX separators.

If a CodebaseResource already exists in the project with the same path, the error raised on save() is not stored in the database and the creation is skipped.

scanpipe.pipes.update_or_create_package(project, package_data, codebase_resource=None)

Gets, updates or creates a DiscoveredPackage then returns it. Uses the project and package_data mapping to lookup and creates the DiscoveredPackage using its Package URL and package_uid as a unique key.

scanpipe.pipes.update_or_create_dependencies(project, dependency_data, strip_datafile_path_root=False)

Gets, updates or creates a DiscoveredDependency then returns it. Uses the project and dependency_data mapping to lookup and creates the DiscoveredDependency using its dependency_uid and for_package_uid as a unique key.

If strip_datafile_path_root is True, then DiscoveredDependency.create_from_data() will strip the root path segment from the datafile_path of dependency_data before looking up the corresponding CodebaseResource for datafile_path. This is used in the case where Dependency data is imported from a scancode-toolkit scan, where the root path segments are not stripped for `datafile_path`s.

scanpipe.pipes.analyze_scanned_files(project)

Sets the status for CodebaseResource to unknown or no license.

scanpipe.pipes.tag_not_analyzed_codebase_resources(project)

Flags any of the project’s ‘CodebaseResource without a status as “not-analyzed”.

scanpipe.pipes.normalize_path(path)

Returns a normalized path from a path string.

scanpipe.pipes.strip_root(location)

Returns the provided location without the root directory.

scanpipe.pipes.filename_now(sep='-')

Returns the current date and time in iso format suitable for filename.

scanpipe.pipes.count_group_by(queryset, field_name)

Returns a summary of all existing values for the provided field_name on the queryset, including the count of each entry, as a dictionary.

scanpipe.pipes.get_bin_executable(filename)

Returns the location of the filename executable binary.

scanpipe.pipes.run_command(cmd, log_output=False)

Returns (exitcode, output) of executing the provided cmd in a shell. cmd can be provided as a string or as a list of arguments.

If log_output is True, the stdout and stderr of the process will be captured and streamed to the logger.

scanpipe.pipes.remove_prefix(text, prefix)

Removes the prefix from text.

Codebase

scanpipe.pipes.codebase.get_tree(resource, fields, codebase=None)

Returns a tree as a dictionary structure starting from the provided resource.

The following classes are supported for the input resource object:
  • scanpipe.models.CodebaseResource

  • commoncode.resource.Resource

The data included for each child is controlled with the fields argument. The codebase is only required in the context of a commoncode Resource input.

class scanpipe.pipes.codebase.ProjectCodebase(project)

Represents the codebase of a project stored in the database. A Codebase is a tree of Resources.

__init__(project)

Compliance

scanpipe.pipes.compliance.tag_compliance_files(project)

Tags compliance files status for the provided project.

scanpipe.pipes.compliance.analyze_compliance_licenses(project)

Scans compliance licenses status for the provided project.

Docker

scanpipe.pipes.docker.get_tarballs_from_inputs(project)

Returns the tarballs from the project input/ work directory. Supported file extensions: .tar, .tar.gz, .tgz.

scanpipe.pipes.docker.extract_images_from_inputs(project)

Collects all the tarballs from the project input/ work directory, extracts each tarball to the tmp/ work directory and collects the images.

Returns the images and an errors list of error messages that may have happen during the extraction.

scanpipe.pipes.docker.extract_image_from_tarball(input_tarball, extract_target, verify=True)

Extract images from an input_tarball to an extract_target directory Path object and collects the extracted images.

Returns the images and an errors list of error messages that may have happened during the extraction.

scanpipe.pipes.docker.extract_layers_from_images(project, images)

Extracts all layers from the provided images into the project codebase work directory.

Returns an errors list of error messages that may occur during the extraction.

scanpipe.pipes.docker.extract_layers_from_images_to_base_path(base_path, images)

Extracts all layers from the provided images into the base_path work directory.

Returns an errors list of error messages that may occur during the extraction.

scanpipe.pipes.docker.get_image_data(image, layer_path_segments=2)

Returns a mapping of image-related data given an image. Keep only layer_path_segments trailing layer location segments (or keep the locations unmodified if layer_path_segments is 0).

scanpipe.pipes.docker.get_layer_tag(image_id, layer_id, layer_index, id_length=6)

Returns a “tag” crafted from the provided image_id, layer_id, and layer_index. The purpose of this tag is to be short, clear and sortable.

For instance, given an image with an id:

785df58b6b3e120f59bce6cd10169a0c58b8837b24f382e27593e2eea011a0d8

and two layers from bottom to top as:

0690c89adf3e8c306d4ced085fc16d1d104dcfddd6dc637e141fa78be242a707 7a1d89d2653e8e4aa9011fd95034a4857109d6636f2ad32df470a196e5dd1585

we would get these two tags:

img-785df5-layer-01-0690c8 img-785df5-layer-02-7a1d89

scanpipe.pipes.docker.create_codebase_resources(project, image)

Creates the CodebaseResource for an image in a project.

scanpipe.pipes.docker.scan_image_for_system_packages(project, image, detect_licenses=True)

Given a project and an image - this scans the image layer by layer for installed system packages and creates a DiscoveredPackage for each.

Then for each installed DiscoveredPackage file, check if it exists as a CodebaseResource. If exists, relate that CodebaseResource to its DiscoveredPackage; otherwise, keep that as a missing file.

scanpipe.pipes.docker.tag_whiteout_codebase_resources(project)

Marks overlayfs/AUFS whiteout special files CodebaseResource as “ignored-whiteout”. See https://github.com/opencontainers/image-spec/blob/master/layer.md#whiteouts for details.

Windows

scanpipe.pipes.windows.package_getter(root_dir, **kwargs)

Returns installed package objects.

scanpipe.pipes.windows.tag_uninteresting_windows_codebase_resources(project)

Tags known uninteresting files as uninteresting

scanpipe.pipes.windows.tag_installed_package_files(project, root_dir_pattern, package, q_objects=None)

For all CodebaseResources from project whose rootfs_path starts with root_dir_pattern, add package to the discovered_packages of each CodebaseResource and set the status.

scanpipe.pipes.windows.tag_known_software(project)

Finds Windows software in project by checking CodebaseResources to see if their rootfs_path is under a known software root directory. If there are CodebaseResources that are under a known software root directory, a DiscoveredPackage is created for that software package and all files under that software package’s root directory are considered installed files for that package.

Currently, we are only checking for Python and openjdk in Windows Docker image layers.

If a version number cannot be determined for an installed software Package, then a version number of “nv” will be set.

scanpipe.pipes.windows.tag_program_files(project)

Reports all subdirectories of Program Files and Program Files (x86) as Packages.

If a Package is detected in this manner, then we will attempt to determine the version from the path. If a version cannot be determined, a version of nv will be set for the Package.

Fetch

scanpipe.pipes.fetch.fetch_http(uri, to=None)

Downloads a given uri in a temporary directory and return the directory’s path.

exception scanpipe.pipes.fetch.FetchDockerImageError
scanpipe.pipes.fetch.get_docker_image_platform(docker_reference)

Returns a platform mapping of a docker reference. If there are more than one, return the first one by default.

scanpipe.pipes.fetch.fetch_docker_image(docker_reference, to=None)

Fetches a docker image from the provided Docker image docker_reference docker:// reference URL. Return a download object.

Docker references are documented here: https://github.com/containers/skopeo/blob/0faf16017/docs/skopeo.1.md#image-names

scanpipe.pipes.fetch.fetch_urls(urls)

Fetches provided urls list. The urls can also be provided as a string containing one URL per line. Returns the fetched URLs as downloads objects and a list of errors.

Input

scanpipe.pipes.input.copy_input(input_location, dest_path)

Copies the input_location to the dest_path.

scanpipe.pipes.input.copy_inputs(input_locations, dest_path)

Copies the provided input_locations to the dest_path.

scanpipe.pipes.input.move_inputs(inputs, dest_path)

Moves the provided inputs to the dest_path.

Output

scanpipe.pipes.output.get_queryset(project, model_name)

Common source for getting consistent QuerySets across all supported outputs (json, xlsx, csv, …)

scanpipe.pipes.output.queryset_to_csv_file(queryset, fieldnames, output_file)

Outputs csv content generated from the provided queryset objects to the output_file. The fields to be included as columns and their order are controlled by the fieldnames list.

scanpipe.pipes.output.queryset_to_csv_stream(queryset, fieldnames, output_stream)

Outputs csv content generated from the provided queryset objects to the output_stream. The fields to be included as columns and their order are controlled by the fieldnames list.

scanpipe.pipes.output.to_csv(project)

Generates output for the provided project in csv format. Since the csv format does not support multiple tabs, one file is created per object type. The output files are created in the project output/ directory. Returns a list of paths of the generated output files.

scanpipe.pipes.output.to_json(project)

Generates output for the provided project in JSON format. The output file is created in the project output/ directory. Returns the path of the generated output file.

scanpipe.pipes.output.queryset_to_xlsx_worksheet(queryset, workbook, exclude_fields=())

Adds a new worksheet to the workbook xlsxwriter.Workbook using the queryset. The queryset “model_name” is used as a name for the “worksheet”. Exclude fields listed in the exclude_fields sequence of field names.

Adds an extra trailing “xlsx_errors” column with conversion error messages if any. Returns a number of conversion errors.

scanpipe.pipes.output.to_xlsx(project)

Generates output for the provided project in XLSX format. The output file is created in the project “output/” directory. Return the path of the generated output file.

Note that the XLSX worksheets contain each an extra “xlxs_errors” column with possible error messages for a row when converting the data to XLSX exceed the limits of what can be stored in a cell.

RootFS

exception scanpipe.pipes.rootfs.DistroNotFound
exception scanpipe.pipes.rootfs.DistroNotSupported
class scanpipe.pipes.rootfs.RootFs(location, distro=None)

A root filesystem.

classmethod from_project_codebase(project)

Returns RootFs objects collected from the project’s “codebase” directory. Each directory in the input/ is considered as the root of a root filesystem.

get_resources(with_dir=False)

Return a Resource for each file in this rootfs.

get_installed_packages(packages_getter)

Returns tuples of (package_url, package) for installed packages found in this rootfs layer using the packages_getter function or callable.

The packages_getter() function should:

  • Accept a first argument string that is the root directory of filesystem of this rootfs

  • Return tuples of (package_url, package) where package_url is a package_url string that uniquely identifies a package; while, a package is an object that represents a package (typically a scancode- toolkit packagedcode.models.Package class or some nested mapping with the same structure).

The packages_getter function would typically query the system packages database, such as an RPM database or similar, to collect the list of installed system packages.

__init__(location, distro=None) None

Method generated by attrs for class RootFs.

scanpipe.pipes.rootfs.get_resources(location, with_dir=False)

Returns the Resource found in the location in root directory of a rootfs.

scanpipe.pipes.rootfs.create_codebase_resources(project, rootfs)

Creates the CodebaseResource for a rootfs in project.

scanpipe.pipes.rootfs.has_hash_diff(install_file, codebase_resource)

Returns True if one of available hashes on both install_file and codebase_resource, by hash type, is different. For example: Alpine uses SHA1 while Debian uses MD5, we prefer the strongest hash that’s present.

scanpipe.pipes.rootfs.package_getter(root_dir, **kwargs)

Returns installed package objects.

scanpipe.pipes.rootfs.scan_rootfs_for_system_packages(project, rootfs, detect_licenses=True)

Given a project Project and a rootfs RootFs, scan the rootfs for installed system packages, and create a DiscoveredPackage for each.

Then for each installed DiscoveredPackage file, check if it exists as a CodebaseResource. If exists, relate that CodebaseResource to its DiscoveredPackage; otherwise, keep that as a missing file.

scanpipe.pipes.rootfs.get_resource_with_md5(project, status)

Return a queryset of CodebaseResource from a project that has a status, a non-empty size, and md5.

scanpipe.pipes.rootfs.match_not_analyzed(project, reference_status='system-package', not_analyzed_status='not-analyzed')

Given a project Project : 1. Build an MD5 index of files assigned to a package that has a status of reference_status 2. Attempt to match resources with status not_analyzed_status to that index 3. Relate each matched CodebaseResource to the matching DiscoveredPackage and set its status.

scanpipe.pipes.rootfs.tag_empty_codebase_resources(project)

Tags empty files as ignored.

scanpipe.pipes.rootfs.tag_uninteresting_codebase_resources(project)

Checks any file that doesn’t belong to any system package and determine if it’s: - A temp file - Generated - Log file of sorts (such as var) using few heuristics

scanpipe.pipes.rootfs.tag_ignorable_codebase_resources(project)

Using the glob patterns from commoncode.ignore of ignorable files/directories, tag codebase resources from project if their paths match an ignorable pattern.

scanpipe.pipes.rootfs.tag_data_files_with_no_clues(project)

Tags CodebaseResources that have a file type of data and no detected clues to be uninteresting.

scanpipe.pipes.rootfs.tag_media_files_as_uninteresting(project)

Tags CodebaseResources that are media files to be uninteresting.

ScanCode

scanpipe.pipes.scancode.logger = <Logger scanpipe.pipes (INFO)>

Utilities to deal with ScanCode toolkit features and objects.

scanpipe.pipes.scancode.get_max_workers(keep_available)

Returns the SCANCODEIO_PROCESSES if defined in the setting, or returns a default value based on the number of available CPUs, minus the provided keep_available value.

On operating system where the multiprocessing start method is not “fork”, but for example “spawn”, such as on macOS, multiprocessing and threading are disabled by default returning -1 max_workers.

scanpipe.pipes.scancode.extract_archive(location, target)

Extracts a single archive or compressed file at location to the target directory.

Returns a list of extraction errors.

Wrapper of the extractcode.api.extract_archive function.

scanpipe.pipes.scancode.extract_archives(location, recurse=False)

Extracts all archives at location and return errors.

Archives and compressed files are extracted in a new directory named “<file_name>-extract” created in the same directory as each extracted archive.

If recurse is True, extract nested archives-in-archives recursively.

Returns a list of extraction errors.

Wrapper of the extractcode.api.extract_archives function.

scanpipe.pipes.scancode.get_resource_info(location)

Returns a mapping suitable for the creation of a new CodebaseResource.

scanpipe.pipes.scancode.scan_file(location, with_threading=True)

Runs a license, copyright, email, and url scan on a provided location, using the scancode-toolkit direct API.

Returns a dictionary of scan results and a list of errors.

scanpipe.pipes.scancode.scan_for_package_data(location, with_threading=True)

Runs a package scan on provided location using the scancode-toolkit direct API.

Returns a dict of scan results and a list of errors.

scanpipe.pipes.scancode.save_scan_file_results(codebase_resource, scan_results, scan_errors)

Saves the resource scan file results in the database. Creates project errors if any occurred during the scan.

scanpipe.pipes.scancode.save_scan_package_results(codebase_resource, scan_results, scan_errors)

Saves the resource scan package results in the database. Creates project errors if any occurred during the scan.

scanpipe.pipes.scancode.scan_for_files(project)

Runs a license, copyright, email, and url scan on files without a status for a project.

Multiprocessing is enabled by default on this pipe, the number of processes can be controlled through the SCANCODEIO_PROCESSES setting.

scanpipe.pipes.scancode.scan_for_application_packages(project)

Runs a package scan on files without a status for a project, then create DiscoveredPackage and DiscoveredDependency instances from the detected package data

Multiprocessing is enabled by default on this pipe, the number of processes can be controlled through the SCANCODEIO_PROCESSES setting.

scanpipe.pipes.scancode.add_resource_to_package(package_uid, resource, project)

Relate a DiscoveredPackage to resource from project using package_uid.

Add a ProjectError when the DiscoveredPackage could not be fetched using the provided package_uid.

scanpipe.pipes.scancode.assemble_packages(project)

Create instances of DiscoveredPackage and DiscoveredDependency for project from the parsed package data present in the CodebaseResources of project.

scanpipe.pipes.scancode.run_scancode(location, output_file, options, raise_on_error=False)

Scans the location content and write the results into an output_file. The scancode executable will run using the provided options. If raise_on_error is enabled, a ScancodeError will be raised if the exitcode is greater than 0.

scanpipe.pipes.scancode.get_virtual_codebase(project, input_location)

Returns a ScanCode virtual codebase built from the JSON scan file located at the input_location.

scanpipe.pipes.scancode.create_codebase_resources(project, scanned_codebase)

Saves the resources of a ScanCode scanned_codebase scancode.resource.Codebase object to the database as a CodebaseResource of the project. This function can be used to expend an existing project Codebase with new CodebaseResource objects as the existing objects (based on the path) will be skipped.

scanpipe.pipes.scancode.create_discovered_packages(project, scanned_codebase)

Saves the packages of a ScanCode scanned_codebase scancode.resource.Codebase object to the database as a DiscoveredPackage of project.

scanpipe.pipes.scancode.create_discovered_dependencies(project, scanned_codebase, strip_datafile_path_root=False)

Saves the dependencies of a ScanCode scanned_codebase scancode.resource.Codebase object to the database as a DiscoveredDependency of project.

If strip_datafile_path_root is True, then DiscoveredDependency.create_from_data() will strip the root path segment from the datafile_path of dependency_data before looking up the corresponding CodebaseResource for datafile_path. This is used in the case where Dependency data is imported from a scancode-toolkit scan, where the root path segments are not stripped for `datafile_path`s.

scanpipe.pipes.scancode.set_codebase_resource_for_package(codebase_resource, discovered_package)

Assigns the discovered_package to the codebase_resource and set its status to “application-package”.

scanpipe.pipes.scancode.make_results_summary(project, scan_results_location)

Extracts selected sections of the Scan results, such as the summary license_clarity_score, and license_matches related data. The key_files are also collected and injected in the summary output.

scanpipe.pipes.scancode.create_inventory_from_scan(project, input_location)

Create CodebaseResource and DiscoveredPackage instances loaded from the scan results located at input_location.