Pipes

Generic

scanpipe.pipes.make_codebase_resource(project, location, save=True, **extra_fields)

Create a CodebaseResource instance in the database for the given project.

The provided location is the absolute path of this resource. It must be rooted in project.codebase_path as only the relative path within the project codebase/ directory is stored in the database.

Extra fields can be provided as keywords arguments to this function call:

make_codebase_resource(
    project=project,
    location=resource.location,
    rootfs_path=resource.path,
    tag=layer_tag,
)

In this example, rootfs_path is an optional path relative to a rootfs root within an Image/VM filesystem context. e.g.: “/var/log/file.log”

All paths use the POSIX separators.

If a CodebaseResource already exists in the project with the same path, the error raised on save() is not stored in the database and the creation is skipped.

scanpipe.pipes.update_or_create_resource(project, resource_data)

Get, update or create a CodebaseResource then return it.

scanpipe.pipes.update_or_create_package(project, package_data, codebase_resources=None)

Get, update or create a DiscoveredPackage then return it. Use the project and package_data mapping to lookup and creates the DiscoveredPackage using its Package URL and package_uid as a unique key. The package can be associated to codebase_resources providing a list or queryset of resources.

scanpipe.pipes.update_or_create_dependency(project, dependency_data, for_package=None, strip_datafile_path_root=False)

Get, update or create a DiscoveredDependency then returns it. Use the project and dependency_data mapping to lookup and creates the DiscoveredDependency using its dependency_uid and for_package_uid as a unique key.

If strip_datafile_path_root is True, then DiscoveredDependency.create_from_data() will strip the root path segment from the datafile_path of dependency_data before looking up the corresponding CodebaseResource for datafile_path. This is used in the case where Dependency data is imported from a scancode-toolkit scan, where the root path segments are not stripped for datafile_path.

scanpipe.pipes.get_or_create_relation(project, relation_data)

Get or create a CodebaseRelation then return it. The support for update is not useful as there is no fields on the model that could be updated.

scanpipe.pipes.normalize_path(path)

Return a normalized path from a path string.

scanpipe.pipes.strip_root(location)

Return the provided location without the root directory.

scanpipe.pipes.filename_now(sep='-')

Return the current date and time in iso format suitable for filename.

scanpipe.pipes.count_group_by(queryset, field_name)

Return a summary of all existing values for the provided field_name on the queryset, including the count of each entry, as a dictionary.

scanpipe.pipes.get_bin_executable(filename)

Return the location of the filename executable binary.

scanpipe.pipes.run_command(cmd, log_output=False)

Return (exitcode, output) of executing the provided cmd in a shell. cmd can be provided as a string or as a list of arguments.

If log_output is True, the stdout and stderr of the process will be captured and streamed to the logger.

scanpipe.pipes.remove_prefix(text, prefix)

Remove the prefix from text. Note that build-in removeprefix was added in Python3.9 but we need to keep this one for Python3.8 support. https://docs.python.org/3.9/library/stdtypes.html#str.removeprefix

scanpipe.pipes.get_progress_percentage(current_index, total_count)

Return the percentage of progress given the current index and total count of objects.

scanpipe.pipes.log_progress(log_func, current_index, total_count, last_percent, increment_percent, start_time=None)

Log progress updates every increment_percent percentage points, given the current index and total count of objects. Return the latest percent logged.

scanpipe.pipes.get_text_str_diff_ratio(str_a, str_b)

Return a similarity ratio as a float between 0 and 1 by comparing the text content of the str_a and str_b.

Return None if any of the two resources str is empty.

scanpipe.pipes.get_resource_diff_ratio(resource_a, resource_b)

Return a similarity ratio as a float between 0 and 1 by comparing the text content of the CodebaseResource resource_a and resource_b.

Return None if any of the two resources are not readable as text.

Codebase

scanpipe.pipes.codebase.get_resource_fields(resource, fields)

Return a mapping of fields from fields and values from resource

scanpipe.pipes.codebase.get_resource_tree(resource, fields, codebase=None, seen_resources={})

Return a tree as a dictionary structure starting from the provided resource.

The following classes are supported for the input resource object:
  • scanpipe.models.CodebaseResource

  • commoncode.resource.Resource

The data included for each child is controlled with the fields argument.

The codebase is only required in the context of a commoncode Resource input.

seen_resources is used when get_resource_tree() is used in the context of get_codebase_tree(). We keep track of child Resources we visit in seen_resources, so we don’t visit them again in get_codebase_tree().

scanpipe.pipes.codebase.get_codebase_tree(codebase, fields)

Return a tree as a dictionary structure starting from the root resources of the provided codebase.

The following classes are supported for the input codebase object:
  • scanpipe.pipes.codebase.ProjectCodebase

  • commoncode.resource.Codebase

  • commoncode.resource.VirtualCodebase

The data included for each child is controlled with the fields argument.

class scanpipe.pipes.codebase.ProjectCodebase(project)

Represents the codebase of a project stored in the database. A Codebase is a tree of Resources.

__init__(project)

Compliance

scanpipe.pipes.compliance.tag_compliance_files(project)

Tag compliance files status for the provided project.

scanpipe.pipes.compliance.analyze_compliance_licenses(project)

Scan compliance licenses status for the provided project.

CycloneDX

scanpipe.pipes.cyclonedx.get_bom(cyclonedx_document)

Return CycloneDX BOM object.

scanpipe.pipes.cyclonedx.get_components(bom)

Return list of components from CycloneDX BOM.

scanpipe.pipes.cyclonedx.bom_attributes_to_dict(cyclonedx_attributes)

Return list of dict from a list of CycloneDX attributes.

scanpipe.pipes.cyclonedx.recursive_component_collector(root_component_list, collected)

Return list of components including the nested components.

scanpipe.pipes.cyclonedx.resolve_license(license)

Return license expression/id/name from license item.

scanpipe.pipes.cyclonedx.get_declared_licenses(licenses)

Return resolved license from list of LicenseChoice.

scanpipe.pipes.cyclonedx.get_checksums(component)

Return dict of all the checksums from a component.

scanpipe.pipes.cyclonedx.get_external_references(component)

Return dict of reference urls from list of component.externalReferences.

scanpipe.pipes.cyclonedx.get_properties_data(component)

Return the properties as dict, extracted from component.properties.

scanpipe.pipes.cyclonedx.validate_document(document, schema=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/scancodeio/envs/latest/lib/python3.9/site-packages/scanpipe/pipes/schemas/bom-1.4.schema.json'))

Check the validity of this CycloneDX document.

scanpipe.pipes.cyclonedx.is_cyclonedx_bom(input_location)

Return True if the file at input_location is a CycloneDX BOM.

Deploy to develop

scanpipe.pipes.d2d.get_inputs(project)

Locate the from and to archives in project inputs directory.

scanpipe.pipes.d2d.get_resource_codebase_root(project, resource_path)

Return “to” or “from” depending on the resource location in the codebase.

scanpipe.pipes.d2d.yield_resources_from_codebase(project)

Yield CodebaseResource instances, including their info data, ready to be inserted in the database using save() or bulk_create().

scanpipe.pipes.d2d.collect_and_create_codebase_resources(project, batch_size=5000)

Collect and create codebase resources including the “to/” and “from/” context using the resource tag field.

The default batch_size can be overriden, although the benefits of a value greater than 5000 objects are usually not significant.

scanpipe.pipes.d2d.get_extracted_path(resource)

Return the -extract/ extracted path of provided resource.

scanpipe.pipes.d2d.get_extracted_subpath(path)

Return the path segments located after the last -extract/ segment.

scanpipe.pipes.d2d.get_best_path_matches(to_resource, matches)

Return the best matches for the provided to_resource.

scanpipe.pipes.d2d.map_checksum(project, checksum_field, logger=None)

Map using checksum.

scanpipe.pipes.d2d.map_java_to_class(project, logger=None)

Map to/ compiled Java .class(es) to from/ .java source using Java fully qualified paths and indexing from/ .java files.

scanpipe.pipes.d2d.get_indexable_qualified_java_paths_from_values(resource_values)

Yield tuples of (resource id, fully-qualified Java path) for indexable classes from a list of resource_data tuples of “from/” side of the project codebase.

These resource_data input tuples are in the form:

(resource.id, resource.name, resource.extra_data)

And the output tuples look like this example::

(123, “org/apache/commons/LoggerImpl.java”)

scanpipe.pipes.d2d.get_indexable_qualified_java_paths(from_resources_dot_java)

Yield tuples of (resource id, fully-qualified Java class name) for indexable classes from the “from/” side of the project codebase using the “java_package” Resource.extra_data.

scanpipe.pipes.d2d.find_java_packages(project, logger=None)

Collect the Java packages of Java source files for a project.

Multiprocessing is enabled by default on this pipe, the number of processes can be controlled through the SCANCODEIO_PROCESSES setting.

Note: we use the same API as the ScanCode scans by design

scanpipe.pipes.d2d.scan_for_java_package(location, with_threading=True)

Run a Java package scan on provided location.

Return a dict of scan results and a list of errors.

scanpipe.pipes.d2d.save_java_package_scan_results(codebase_resource, scan_results, scan_errors)

Save the resource Java package scan results in the database as Resource.extra_data. Create project errors if any occurred during the scan.

scanpipe.pipes.d2d.map_jar_to_source(project, logger=None)

Map .jar files to their related source directory.

scanpipe.pipes.d2d.flag_to_meta_inf_files(project)

Flag all META-INF/* file of the to/ directory as ignored.

scanpipe.pipes.d2d.map_path(project, logger=None)

Map using path suffix similarities.

scanpipe.pipes.d2d.create_package_from_purldb_data(project, resource, package_data)

Create a DiscoveredPackage instance from PurlDB package_data.

scanpipe.pipes.d2d.match_purldb_package(project, resource)

Match an archive type resource in the PurlDB.

scanpipe.pipes.d2d.match_purldb_resource(project, resource)

Match a single file resource in the PurlDB.

scanpipe.pipes.d2d.match_purldb(project, extensions, matcher_func, logger=None)

Match against PurlDB selecting codebase resources using provided package_extensions for archive type files, and resource_extensions for single resource files.

scanpipe.pipes.d2d.map_javascript(project, logger=None)

Map a packed or minified JavaScript, TypeScript, CSS and SCSS to its source.

Docker

scanpipe.pipes.docker.get_tarballs_from_inputs(project)

Return the tarballs from the project input/ work directory. Supported file extensions: .tar, .tar.gz, .tgz.

scanpipe.pipes.docker.extract_images_from_inputs(project)

Collect all the tarballs from the project input/ work directory, extracts each tarball to the tmp/ work directory and collects the images.

Return the images and an errors list of error messages that may have happened during the extraction.

scanpipe.pipes.docker.extract_image_from_tarball(input_tarball, extract_target, verify=True)

Extract images from an input_tarball to an extract_target directory Path object and collects the extracted images.

Return the images and an errors list of error messages that may have happened during the extraction.

scanpipe.pipes.docker.extract_layers_from_images(project, images)

Extract all layers from the provided images into the project codebase work directory.

Return an errors list of error messages that may occur during the extraction.

scanpipe.pipes.docker.extract_layers_from_images_to_base_path(base_path, images)

Extract all layers from the provided images into the base_path work directory.

Return an errors list of error messages that may occur during the extraction.

scanpipe.pipes.docker.get_image_data(image, layer_path_segments=2)

Return a mapping of image-related data given an image. Keep only layer_path_segments trailing layer location segments (or keep the locations unmodified if layer_path_segments is 0).

scanpipe.pipes.docker.get_layer_tag(image_id, layer_id, layer_index, id_length=6)

Return a “tag” crafted from the provided image_id, layer_id, and layer_index. The purpose of this tag is to be short, clear and sortable.

For instance, given an image with an id:

785df58b6b3e120f59bce6cd10169a0c58b8837b24f382e27593e2eea011a0d8

and two layers from bottom to top as:

0690c89adf3e8c306d4ced085fc16d1d104dcfddd6dc637e141fa78be242a707 7a1d89d2653e8e4aa9011fd95034a4857109d6636f2ad32df470a196e5dd1585

we would get these two tags:

img-785df5-layer-01-0690c8 img-785df5-layer-02-7a1d89

scanpipe.pipes.docker.create_codebase_resources(project, image)

Create the CodebaseResource for an image in a project.

scanpipe.pipes.docker.scan_image_for_system_packages(project, image)

Given a project and an image - this scans the image layer by layer for installed system packages and creates a DiscoveredPackage for each.

Then for each installed DiscoveredPackage file, check if it exists as a CodebaseResource. If exists, relate that CodebaseResource to its DiscoveredPackage; otherwise, keep that as a missing file.

scanpipe.pipes.docker.tag_whiteout_codebase_resources(project)

Tag overlayfs/AUFS whiteout special files CodebaseResource as “ignored-whiteout”. See https://github.com/opencontainers/image-spec/blob/master/layer.md#whiteouts for details.

class scanpipe.pipes.docker.Layer(layer_tag, created_by, layer_id, image_id, created, size, author, comment)
author

Alias for field number 6

comment

Alias for field number 7

created

Alias for field number 4

created_by

Alias for field number 1

image_id

Alias for field number 3

layer_id

Alias for field number 2

layer_tag

Alias for field number 0

size

Alias for field number 5

scanpipe.pipes.docker.get_layers_data(project)

Get list of structured layers data from project extra_data field.

Fetch

scanpipe.pipes.fetch.fetch_http(uri, to=None)

Download a given uri in a temporary directory and return the directory’s path.

exception scanpipe.pipes.fetch.FetchDockerImageError
scanpipe.pipes.fetch.get_docker_image_platform(docker_reference)

Return a platform mapping of a docker reference. If there are more than one, return the first one by default.

scanpipe.pipes.fetch.fetch_docker_image(docker_reference, to=None)

Fetch a docker image from the provided Docker image docker_reference docker:// reference URL. Return a download object.

Docker references are documented here: https://github.com/containers/skopeo/blob/0faf16017/docs/skopeo.1.md#image-names

scanpipe.pipes.fetch.fetch_urls(urls)

Fetch provided urls list. The urls can also be provided as a string containing one URL per line. Return the fetched URLs as downloads objects and a list of errors.

Input

scanpipe.pipes.input.copy_input(input_location, dest_path)

Copy the input_location to the dest_path.

scanpipe.pipes.input.copy_inputs(input_locations, dest_path)

Copy the provided input_locations to the dest_path.

scanpipe.pipes.input.move_inputs(inputs, dest_path)

Move the provided inputs to the dest_path.

scanpipe.pipes.input.get_tool_name_from_scan_headers(scan_data)

Return the tool_name of the first header in the provided scan_data.

scanpipe.pipes.input.load_inventory_from_toolkit_scan(project, input_location)

Create packages, dependencies, and resources loaded from the ScanCode-toolkit scan results located at input_location.

scanpipe.pipes.input.load_inventory_from_scanpipe(project, scan_data)

Create packages, dependencies, resources, and relations loaded from a ScanCode.io JSON output provided as scan_data.

scanpipe.pipes.input.get_worksheet_data(worksheet)

Return the data from provided worksheet as a list of dict.

scanpipe.pipes.input.clean_xlsx_field_value(model_class, field_name, value)

Clean the value for compatibility with the database model_class.

scanpipe.pipes.input.clean_xlsx_data_to_model_data(model_class, xlsx_data)

Clean the xlsx_data for compatibility with the database model_class.

scanpipe.pipes.input.load_inventory_from_xlsx(project, input_location)

Create packages, dependencies, resources, and relations loaded from XLSX file located at input_location.

JVM

Support for JVM-specific file formats such as .class and .java files.

scanpipe.pipes.jvm.get_java_package(location, java_extensions=('.java',), **kwargs)

Return a Java package as a mapping with a single “java_package” key, or None from the .java source code file at location.

Only look at files with an extension in the java_extensions tuple.

Note: this is the same API as a ScanCode Toolkit API scanner function by design.

scanpipe.pipes.jvm.find_java_package(lines)

Return a mapping of {'java_package': <value>} or None from an iterable or text lines.

For example:

>>> lines = ["   package    foo.back ;  # dsasdasdasdasdasda.asdasdasd"]
>>> assert find_java_package(lines) == {"java_package": "foo.back"}
scanpipe.pipes.jvm.get_normalized_java_path(path)

Return a normalized .java file path for path .class file path string. Account for inner classes in that their .java file name is the name of their outer class.

For example:

>>> get_normalized_java_path("foo/org/common/Bar$inner.class")
'foo/org/common/Bar.java'
>>> get_normalized_java_path("foo/org/common/Bar.class")
'foo/org/common/Bar.java'
scanpipe.pipes.jvm.get_fully_qualified_java_path(java_package, filename)

Return a fully qualified java path of a .java filename in a java_package string. Note that we use “/” as path separators.

For example:

>>> get_fully_qualified_java_path("org.common" , "Bar.java")
'org/common/Bar.java'

Output

scanpipe.pipes.output.safe_filename(filename)

Convert the provided filename to a safe filename.

scanpipe.pipes.output.get_queryset(project, model_name)

Return a consistent QuerySet for all supported outputs (json, xlsx, csv, …)

scanpipe.pipes.output.queryset_to_csv_file(queryset, fieldnames, output_file)

Output csv content generated from the provided queryset objects to the output_file. The fields to be included as columns and their order are controlled by the fieldnames list.

scanpipe.pipes.output.queryset_to_csv_stream(queryset, fieldnames, output_stream)

Output csv content generated from the provided queryset objects to the output_stream. The fields to be included as columns and their order are controlled by the fieldnames list.

scanpipe.pipes.output.to_csv(project)

Generate output for the provided project in csv format. Since the csv format does not support multiple tabs, one file is created per object type. The output files are created in the project output/ directory. Return a list of paths of the generated output files.

scanpipe.pipes.output.to_json(project)

Generate output for the provided project in JSON format. The output file is created in the project output/ directory. Return the path of the generated output file.

scanpipe.pipes.output.queryset_to_xlsx_worksheet(queryset, workbook, exclude_fields=())

Add a new worksheet to the workbook xlsxwriter.Workbook using the queryset. The queryset “model_name” is used as a name for the “worksheet”. Exclude fields listed in the exclude_fields sequence of field names.

Add an extra trailing “xlsx_errors” column with conversion error messages if any. Return a number of conversion errors.

scanpipe.pipes.output.to_xlsx(project)

Generate output for the provided project in XLSX format. The output file is created in the project “output/” directory. Return the path of the generated output file.

Note that the XLSX worksheets contain each an extra “xlsx_errors” column with possible error messages for a row when converting the data to XLSX exceed the limits of what can be stored in a cell.

scanpipe.pipes.output.to_spdx(project)

Generate output for the provided project in SPDX document format. The output file is created in the project “output/” directory. Return the path of the generated output file.

scanpipe.pipes.output.get_cyclonedx_bom(project)

Return a CycloneDX Bom object filled with provided project data. See https://cyclonedx.org/use-cases/#dependency-graph

scanpipe.pipes.output.to_cyclonedx(project)

Generate output for the provided project in CycloneDX BOM format. The output file is created in the project “output/” directory. Return the path of the generated output file.

scanpipe.pipes.output.render_template(template_location, context)

Render a Django template at template_location using the context dict.

scanpipe.pipes.output.get_attribution_template(project)

Return a custom attribution template if provided or the default one.

scanpipe.pipes.output.make_unknown_license_object(license_symbol)

Return a License object suitable for the provided license_symbol, that is representing a license key unknown by the current toolkit licensed index.

scanpipe.pipes.output.get_package_expression_symbols(parsed_expression)

Return the list of license_symbols contained in the parsed_expression. Since unknown license keys are missing a License set in the wrapped attribute, a special “unknown” License object is injected.

scanpipe.pipes.output.to_attribution(project)

Generate attribution for the provided project. The output file is created in the project “output/” directory. Return the path of the generated output file. Custom template can be provided in the codebase/.scancode/templates/attribution.html location.

PurlDB

scanpipe.pipes.purldb.is_configured()

Return True if the required PurlDB settings have been set.

scanpipe.pipes.purldb.is_available()

Return True if the configured PurlDB server is available.

scanpipe.pipes.purldb.request_get(url, payload=None, timeout=None)

Wrap the HTTP request calls on the API.

scanpipe.pipes.purldb.match_package(sha1, timeout=None, api_url=None)

Match a SHA1 in the PurlDB for package-type file.

scanpipe.pipes.purldb.match_resource(sha1_list, timeout=None, api_url=None)

Match list SHA1 in the PurlDB for a single resource file.

Resolve

scanpipe.pipes.resolve.resolve_packages(input_location)

Resolve the packages from manifest file.

scanpipe.pipes.resolve.resolve_pypi_packages(input_location)

Resolve the PyPI packages from the input_location requirements file.

scanpipe.pipes.resolve.resolve_about_packages(input_location)

Resolve the packages from the input_location .ABOUT file.

scanpipe.pipes.resolve.convert_spdx_expression(license_expression_spdx)

Return an ScanCode license expression from a SPDX license_expression_spdx string.

scanpipe.pipes.resolve.resolve_spdx_packages(input_location)

Resolve the packages from the input_location SPDX document file.

scanpipe.pipes.resolve.cyclonedx_component_to_package_data(component_data)

Return package_data from CycloneDX component.

scanpipe.pipes.resolve.resolve_cyclonedx_packages(input_location)

Resolve the packages from the input_location CycloneDX document file.

scanpipe.pipes.resolve.get_default_package_type(input_location)

Return the package type associated with the provided input_location. This type is used to get the related handler that knows how process the input.

scanpipe.pipes.resolve.set_license_expression(package_data)

Set the license expression from a detected license dict/str in provided package_data.

RootFS

exception scanpipe.pipes.rootfs.DistroNotFound
exception scanpipe.pipes.rootfs.DistroNotSupported
class scanpipe.pipes.rootfs.RootFs(location, distro=None)

A root filesystem.

classmethod from_project_codebase(project)

Return RootFs objects collected from the project’s “codebase” directory. Each directory in the input/ is considered as the root of a root filesystem.

get_resources(with_dir=False)

Return a Resource for each file in this rootfs.

get_installed_packages(packages_getter)

Return tuples of (package_url, package) for installed packages found in this rootfs layer using the packages_getter function or callable.

The packages_getter() function should:

  • Accept a first argument string that is the root directory of filesystem of this rootfs

  • Return tuples of (package_url, package) where package_url is a package_url string that uniquely identifies a package; while, a package is an object that represents a package (typically a scancode- toolkit packagedcode.models.Package class or some nested mapping with the same structure).

The packages_getter function would typically query the system packages database, such as an RPM database or similar, to collect the list of installed system packages.

__init__(location, distro=None) None

Method generated by attrs for class RootFs.

scanpipe.pipes.rootfs.get_resources(location, with_dir=False)

Return the Resource found in the location in root directory of a rootfs.

scanpipe.pipes.rootfs.create_codebase_resources(project, rootfs)

Create the CodebaseResource for a rootfs in project.

scanpipe.pipes.rootfs.has_hash_diff(install_file, codebase_resource)

Return True if one of available hashes on both install_file and codebase_resource, by hash type, is different. For example: Alpine uses SHA1 while Debian uses MD5, we prefer the strongest hash that’s present.

scanpipe.pipes.rootfs.package_getter(root_dir, **kwargs)

Return installed package objects.

scanpipe.pipes.rootfs.scan_rootfs_for_system_packages(project, rootfs)

Given a project Project and a rootfs RootFs, scan the rootfs for installed system packages, and create a DiscoveredPackage for each.

Then for each installed DiscoveredPackage file, check if it exists as a CodebaseResource. If exists, relate that CodebaseResource to its DiscoveredPackage; otherwise, keep that as a missing file.

scanpipe.pipes.rootfs.get_resource_with_md5(project, status)

Return a queryset of CodebaseResource from a project that has a status, a non-empty size, and md5.

scanpipe.pipes.rootfs.match_not_analyzed(project, reference_status='system-package', not_analyzed_status='not-analyzed')

Given a project Project : 1. Build an MD5 index of files assigned to a package that has a status of reference_status 2. Attempt to match resources with status not_analyzed_status to that index 3. Relate each matched CodebaseResource to the matching DiscoveredPackage and set its status.

scanpipe.pipes.rootfs.tag_uninteresting_codebase_resources(project)

Check any file that doesn’t belong to any system package and determine if it’s: - A temp file - Generated - Log file of sorts (such as var) using few heuristics

scanpipe.pipes.rootfs.tag_ignorable_codebase_resources(project)

Tag codebase resource using the glob patterns from commoncode.ignore of ignorable files/directories, if their paths match an ignorable pattern.

scanpipe.pipes.rootfs.tag_data_files_with_no_clues(project)

Tags CodebaseResources that have a file type of data and no detected clues to be uninteresting.

scanpipe.pipes.rootfs.tag_media_files_as_uninteresting(project)

Tags CodebaseResources that are media files to be uninteresting.

ScanCode

scanpipe.pipes.scancode.logger = <Logger scanpipe.pipes (INFO)>

Utilities to deal with ScanCode toolkit features and objects.

scanpipe.pipes.scancode.get_max_workers(keep_available)

Return the SCANCODEIO_PROCESSES if defined in the setting, or returns a default value based on the number of available CPUs, minus the provided keep_available value.

On operating system where the multiprocessing start method is not “fork”, but for example “spawn”, such as on macOS, multiprocessing and threading are disabled by default returning -1 max_workers.

scanpipe.pipes.scancode.extract_archive(location, target)

Extract a single archive or compressed file at location to the target directory.

Return a list of extraction errors.

Wrapper of the extractcode.api.extract_archive function.

scanpipe.pipes.scancode.extract_archives(location, recurse=False)

Extract all archives at location and return errors.

Archives and compressed files are extracted in a new directory named “<file_name>-extract” created in the same directory as each extracted archive.

If recurse is True, extract nested archives-in-archives recursively.

Return a list of extraction errors.

Wrapper of the extractcode.api.extract_archives function.

scanpipe.pipes.scancode.get_resource_info(location)

Return a mapping suitable for the creation of a new CodebaseResource.

scanpipe.pipes.scancode.scan_file(location, with_threading=True)

Run a license, copyright, email, and url scan on a provided location, using the scancode-toolkit direct API.

Return a dictionary of scan results and a list of errors.

scanpipe.pipes.scancode.scan_for_package_data(location, with_threading=True)

Run a package scan on provided location using the scancode-toolkit direct API.

Return a dict of scan results and a list of errors.

scanpipe.pipes.scancode.save_scan_file_results(codebase_resource, scan_results, scan_errors)

Save the resource scan file results in the database. Create project errors if any occurred during the scan.

scanpipe.pipes.scancode.save_scan_package_results(codebase_resource, scan_results, scan_errors)

Save the resource scan package results in the database. Create project errors if any occurred during the scan.

scanpipe.pipes.scancode.scan_for_files(project, resource_qs=None)

Run a license, copyright, email, and url scan on files without a status for a project.

Multiprocessing is enabled by default on this pipe, the number of processes can be controlled through the SCANCODEIO_PROCESSES setting.

scanpipe.pipes.scancode.scan_for_application_packages(project)

Run a package scan on files without a status for a project, then create DiscoveredPackage and DiscoveredDependency instances from the detected package data

Multiprocessing is enabled by default on this pipe, the number of processes can be controlled through the SCANCODEIO_PROCESSES setting.

scanpipe.pipes.scancode.add_resource_to_package(package_uid, resource, project)

Relate a DiscoveredPackage to resource from project using package_uid.

Add a ProjectError when the DiscoveredPackage could not be fetched using the provided package_uid.

scanpipe.pipes.scancode.assemble_packages(project)

Create instances of DiscoveredPackage and DiscoveredDependency for project from the parsed package data present in the CodebaseResources of project.

scanpipe.pipes.scancode.run_scancode(location, output_file, options, raise_on_error=False)

Scan the location content and write the results into an output_file. The scancode executable will run using the provided options. If raise_on_error is enabled, a ScancodeError will be raised if the exitcode is greater than 0.

scanpipe.pipes.scancode.get_virtual_codebase(project, input_location)

Return a ScanCode virtual codebase built from the JSON scan file located at the input_location.

scanpipe.pipes.scancode.create_codebase_resources(project, scanned_codebase)

Save the resources of a ScanCode scanned_codebase scancode.resource.Codebase object to the database as a CodebaseResource of the project. This function can be used to expend an existing project Codebase with new CodebaseResource objects as the existing objects (based on the path) will be skipped.

scanpipe.pipes.scancode.create_discovered_packages(project, scanned_codebase)

Save the packages of a ScanCode scanned_codebase scancode.resource.Codebase object to the database as a DiscoveredPackage of project.

scanpipe.pipes.scancode.create_discovered_dependencies(project, scanned_codebase, strip_datafile_path_root=False)

Save the dependencies of a ScanCode scanned_codebase scancode.resource.Codebase object to the database as a DiscoveredDependency of project.

If strip_datafile_path_root is True, then DiscoveredDependency.create_from_data() will strip the root path segment from the datafile_path of dependency_data before looking up the corresponding CodebaseResource for datafile_path. This is used in the case where Dependency data is imported from a scancode-toolkit scan, where the root path segments are not stripped for datafile_path.

scanpipe.pipes.scancode.set_codebase_resource_for_package(codebase_resource, discovered_package)

Assign the discovered_package to the codebase_resource and set its status to “application-package”.

scanpipe.pipes.scancode.make_results_summary(project, scan_results_location)

Extract selected sections of the Scan results, such as the summary license_clarity_score, and license_matches related data. The key_files are also collected and injected in the summary output.

SPDX

scanpipe.pipes.spdx.SPDX_SCHEMA_URL = 'https://raw.githubusercontent.com/spdx/spdx-spec/v2.3/schemas/spdx-schema.json'

Generate SPDX Documents. Spec documentation: https://spdx.github.io/spdx-spec/v2.3/

Usage:

import pathlib
from scanpipe.pipes import spdx

creation_info = spdx.CreationInfo(
    person_name="John Doe",
    person_email="john@starship.space",
    organization_name="Starship",
    tool="SPDXCode-1.0",
)

package1 = spdx.Package(
    spdx_id="SPDXRef-package1",
    name="lxml",
    version="3.3.5",
    license_concluded="LicenseRef-1",
    checksums=[
        spdx.Checksum(
            algorithm="SHA1", value="10c72b88de4c5f3095ebe20b4d8afbedb32b8f"
        ),
        spdx.Checksum(algorithm="MD5", value="56770c1a2df6e0dc51c491f0a5b9d865"),
    ],
    external_refs=[
        spdx.ExternalRef(
            category="PACKAGE-MANAGER",
            type="purl",
            locator="pkg:pypi/lxml@3.3.5",
        ),
    ]
)

document = spdx.Document(
    name="Document name",
    namespace="https://[CreatorWebsite]/[pathToSpdx]/[DocumentName]-[UUID]",
    creation_info=creation_info,
    packages=[package1],
    extracted_licenses=[
        spdx.ExtractedLicensingInfo(
            license_id="LicenseRef-1",
            extracted_text="License Text",
            name="License 1",
            see_alsos=["https://license1.text"],
        ),
    ],
    comment="This document was created using SPDXCode-1.0",
)

# Display document content:
print(document.as_json())

# Validate document
schema = pathlib.Path(spdx.SPDX_JSON_SCHEMA_LOCATION).read_text()
document.validate(schema)

# Write document to a file:
with open("document_name.spdx.json", "w") as f:
    f.write(document.as_json())
class scanpipe.pipes.spdx.CreationInfo(person_name: str = '', organization_name: str = '', tool: str = '', person_email: str = '', organization_email: str = '', license_list_version: str = '3.18', comment: str = '', created: str = <factory>)

One instance is required for each SPDX file produced. It provides the necessary information for forward and backward compatibility for processing tools.

comment: str = ''

Identify when the SPDX document was originally created. The date is to be specified according to combined date and time in UTC format as specified in ISO 8601 standard. Format: YYYY-MM-DDThh:mm:ssZ

as_dict()

Return the data as a serializable dict.

get_creators_spdx()

Return the creators list from related field values.

static get_creators_dict(creators_data)

Return the creators dict from SPDX data.

__init__(person_name: str = '', organization_name: str = '', tool: str = '', person_email: str = '', organization_email: str = '', license_list_version: str = '3.18', comment: str = '', created: str = <factory>) None
class scanpipe.pipes.spdx.Checksum(algorithm: str, value: str)

The checksum provides a mechanism that can be used to verify that the contents of a File or Package have not changed.

as_dict()

Return the data as a serializable dict.

__init__(algorithm: str, value: str) None
class scanpipe.pipes.spdx.ExternalRef(category: str, type: str, locator: str, comment: str = '')

An External Reference allows a Package to reference an external source of additional information, metadata, enumerations, asset identifiers, or downloadable content believed to be relevant to the Package.

as_dict()

Return the data as a serializable dict.

__init__(category: str, type: str, locator: str, comment: str = '') None
class scanpipe.pipes.spdx.ExtractedLicensingInfo(license_id: str, extracted_text: str, name: str = '', comment: str = '', see_alsos: ~typing.List[str] = <factory>)

An ExtractedLicensingInfo represents a license or licensing notice that was found in a package, file or snippet. Any license text that is recognized as a license may be represented as a License rather than an ExtractedLicensingInfo.

as_dict()

Return the data as a serializable dict.

__init__(license_id: str, extracted_text: str, name: str = '', comment: str = '', see_alsos: ~typing.List[str] = <factory>) None
class scanpipe.pipes.spdx.Package(spdx_id: str, name: str, download_location: str = 'NOASSERTION', license_declared: str = 'NOASSERTION', license_concluded: str = 'NOASSERTION', copyright_text: str = 'NOASSERTION', files_analyzed: bool = False, version: str = '', supplier: str = '', originator: str = '', homepage: str = '', filename: str = '', description: str = '', summary: str = '', source_info: str = '', release_date: str = '', built_date: str = '', valid_until_date: str = '', primary_package_purpose: str = '', comment: str = '', license_comments: str = '', checksums: ~typing.List[~scanpipe.pipes.spdx.Checksum] = <factory>, external_refs: ~typing.List[~scanpipe.pipes.spdx.ExternalRef] = <factory>, attribution_texts: ~typing.List[str] = <factory>)

Packages referenced in the SPDX document.

as_dict()

Return the data as a serializable dict.

static date_to_iso(date_str)

Convert a provided date_str to the SPDX format: YYYY-MM-DDThh:mm:ssZ.

__init__(spdx_id: str, name: str, download_location: str = 'NOASSERTION', license_declared: str = 'NOASSERTION', license_concluded: str = 'NOASSERTION', copyright_text: str = 'NOASSERTION', files_analyzed: bool = False, version: str = '', supplier: str = '', originator: str = '', homepage: str = '', filename: str = '', description: str = '', summary: str = '', source_info: str = '', release_date: str = '', built_date: str = '', valid_until_date: str = '', primary_package_purpose: str = '', comment: str = '', license_comments: str = '', checksums: ~typing.List[~scanpipe.pipes.spdx.Checksum] = <factory>, external_refs: ~typing.List[~scanpipe.pipes.spdx.ExternalRef] = <factory>, attribution_texts: ~typing.List[str] = <factory>) None
class scanpipe.pipes.spdx.File(spdx_id: str, name: str, checksums: ~typing.List[~scanpipe.pipes.spdx.Checksum] = <factory>, license_concluded: str = 'NOASSERTION', copyright_text: str = 'NOASSERTION', license_in_files: ~typing.List[str] = <factory>, contributors: ~typing.List[str] = <factory>, notice_text: str = '', types: ~typing.List[str] = <factory>, attribution_texts: ~typing.List[str] = <factory>, comment: str = '', license_comments: str = '')

Files referenced in the SPDX document.

as_dict()

Return the data as a serializable dict.

__init__(spdx_id: str, name: str, checksums: ~typing.List[~scanpipe.pipes.spdx.Checksum] = <factory>, license_concluded: str = 'NOASSERTION', copyright_text: str = 'NOASSERTION', license_in_files: ~typing.List[str] = <factory>, contributors: ~typing.List[str] = <factory>, notice_text: str = '', types: ~typing.List[str] = <factory>, attribution_texts: ~typing.List[str] = <factory>, comment: str = '', license_comments: str = '') None
class scanpipe.pipes.spdx.Relationship(spdx_id: str, related_spdx_id: str, relationship: str, comment: str = '')

Represent the relationship between two SPDX elements. For example, you can represent a relationship between two different Files, between a Package and a File, between two Packages, or between one SPDXDocument and another SPDXDocument.

as_dict()

Return the SPDX relationship as a serializable dict.

__init__(spdx_id: str, related_spdx_id: str, relationship: str, comment: str = '') None
class scanpipe.pipes.spdx.Document(name: str, namespace: str, creation_info: ~scanpipe.pipes.spdx.CreationInfo, packages: ~typing.List[~scanpipe.pipes.spdx.Package], spdx_id: str = 'SPDXRef-DOCUMENT', version: str = '2.3', data_license: str = 'CC0-1.0', comment: str = '', files: ~typing.List[~scanpipe.pipes.spdx.File] = <factory>, extracted_licenses: ~typing.List[~scanpipe.pipes.spdx.ExtractedLicensingInfo] = <factory>, relationships: ~typing.List[~scanpipe.pipes.spdx.Relationship] = <factory>)

Collection of section instances each of which contains information about software organized using the SPDX format.

as_dict()

Return the SPDX document as a serializable dict.

as_json(indent=2)

Return the SPDX document as serialized JSON.

static safe_document_name(name)

Convert provided name to a safe SPDX document name.

validate(schema)

Check the validity of this SPDX document.

__init__(name: str, namespace: str, creation_info: ~scanpipe.pipes.spdx.CreationInfo, packages: ~typing.List[~scanpipe.pipes.spdx.Package], spdx_id: str = 'SPDXRef-DOCUMENT', version: str = '2.3', data_license: str = 'CC0-1.0', comment: str = '', files: ~typing.List[~scanpipe.pipes.spdx.File] = <factory>, extracted_licenses: ~typing.List[~scanpipe.pipes.spdx.ExtractedLicensingInfo] = <factory>, relationships: ~typing.List[~scanpipe.pipes.spdx.Relationship] = <factory>) None
scanpipe.pipes.spdx.validate_document(document, schema=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/scancodeio/envs/latest/lib/python3.9/site-packages/scanpipe/pipes/schemas/spdx-schema-2.3.json'))

SPDX document validation. Requires the jsonschema library.

scanpipe.pipes.spdx.is_spdx_document(input_location)

Return True if the file at input_location is a SPDX Document.

Flag

scanpipe.pipes.flag.flag_empty_codebase_resources(project)

Flag empty files as ignored.

scanpipe.pipes.flag.flag_ignored_directories(project)

Flag directories as ignored.

scanpipe.pipes.flag.flag_ignored_filenames(project, filenames)

Flag codebase resource as ignored status from list of filenames.

scanpipe.pipes.flag.flag_ignored_extensions(project, extensions)

Flag codebase resource as ignored status from list of extensions.

scanpipe.pipes.flag.flag_ignored_paths(project, paths)

Flag codebase resource as ignored status from list of paths.

scanpipe.pipes.flag.analyze_scanned_files(project)

Set the status for CodebaseResource to unknown or no license.

scanpipe.pipes.flag.tag_not_analyzed_codebase_resources(project)

Flag codebase resource as not-analyzed.

scanpipe.pipes.flag.flag_mapped_resources(project)

Flag all codebase resources that were mapped during the d2d pipeline.

VulnerableCode

scanpipe.pipes.vulnerablecode.is_configured()

Return True if the required VulnerableCode settings have been set.

scanpipe.pipes.vulnerablecode.is_available()

Return True if the configured VulnerableCode server is available.

scanpipe.pipes.vulnerablecode.get_base_purl(purl)

Return the purl without qualifiers and subpath.

scanpipe.pipes.vulnerablecode.get_purls(packages, base=False)

Return the PURLs for the given list of packages. Do not include qualifiers nor subpath when base is provided.

scanpipe.pipes.vulnerablecode.request_get(url, payload=None, timeout=None)

Wrap the HTTP request calls on the API.

scanpipe.pipes.vulnerablecode.get_vulnerabilities_by_purl(purl, timeout=None, api_url=None)

Get the list of vulnerabilities providing a package purl.

scanpipe.pipes.vulnerablecode.get_vulnerabilities_by_cpe(cpe, timeout=None, api_url=None)

Get the list of vulnerabilities providing a package or component cpe.

scanpipe.pipes.vulnerablecode.bulk_search_by_purl(purls, timeout=None, api_url=None)

Bulk search of vulnerabilities using the provided list of purls.

scanpipe.pipes.vulnerablecode.bulk_search_by_cpes(cpes, timeout=None, api_url=None)

Bulk search of vulnerabilities using the provided list of cpes.

Windows

scanpipe.pipes.windows.package_getter(root_dir, **kwargs)

Return installed package objects.

scanpipe.pipes.windows.tag_uninteresting_windows_codebase_resources(project)

Tag known uninteresting files as uninteresting.

scanpipe.pipes.windows.tag_installed_package_files(project, root_dir_pattern, package, q_objects=None)

For all CodebaseResources from project whose rootfs_path starts with root_dir_pattern, add package to the discovered_packages of each CodebaseResource and set the status.

scanpipe.pipes.windows.tag_known_software(project)

Find Windows software in project by checking CodebaseResources to see if their rootfs_path is under a known software root directory. If there are CodebaseResources that are under a known software root directory, a DiscoveredPackage is created for that software package and all files under that software package’s root directory are considered installed files for that package.

Currently, we are only checking for Python and openjdk in Windows Docker image layers.

If a version number cannot be determined for an installed software Package, then a version number of “nv” will be set.

scanpipe.pipes.windows.tag_program_files(project)

Report all subdirectories of Program Files and Program Files (x86) as Packages.

If a Package is detected in this manner, then we will attempt to determine the version from the path. If a version cannot be determined, a version of nv will be set for the Package.