Data Models

This section is a collection of concepts or notations for describing the structure of the ScanCode.io Data Model and providing details about all fields included in the output files.

Project

class scanpipe.models.Project

The Project encapsulates all analysis processing. Multiple analysis pipelines can be run on the same project.

Parameters
  • uuid (UUIDField) – Primary key: UUID

  • extra_data (JSONField) – Extra data. Optional mapping of extra data key/values.

  • created_date (DateTimeField) – Created date. Creation date for this project.

  • name (CharField) – Name. Name for this project.

  • work_directory (CharField) – Work directory. Project work directory location.

  • input_sources (JSONField) – Input sources

  • is_archived (BooleanField) – Is archived. Archived projects cannot be modified anymore and are not displayed by default in project lists. Multiple levels of data cleanup may have happened during the archive operation.

Reverse relationships:

Parameters
add_downloads(downloads)

Moves the given downloads to the current project’s input/ directory and adds the input_source for each entry.

add_error(error, model, details=None)

Creates a “ProjectError” record from the provided error Exception for this project. The model attribute can be provided as a string or as a Model class.

add_input_source(filename, source, save=False)

Adds given filename and source to the current project’s input_sources field.

add_pipeline(pipeline_name, execute_now=False)

Creates a new Run instance with the provided pipeline on the current project.

If execute_now is True, the pipeline task is created. on_commit() is used to postpone the task creation after the transaction is successfully committed. If there isn’t any active transactions, the callback will be executed immediately.

add_uploads(uploads)

Writes the given uploads to the current project’s input/ directory and adds the input_source for each entry.

add_webhook_subscription(target_url)

Creates a new WebhookSubscription instance with the provided target_url for the current project.

archive(remove_input=False, remove_codebase=False, remove_output=False)

Set the project is_archived field to True.

The remove_input, remove_codebase, and remove_output can be provided during the archive operation to delete the related work directories.

The project cannot be archived if one of its related run is queued or already running.

clear_tmp_directory()

Deletes the whole content of the tmp/ directory. This is called at the end of each pipeline Run, and it doesn’t store any content that might be needed for further processing in following pipeline Run.

copy_input_from(input_location)

Copies the file at input_location to the current project’s input/ directory.

delete(*args, **kwargs)

Deletes the work_directory along all project-related data in the database.

get_latest_failed_run()

Returns the latest failed Run instance of the current project.

get_latest_output(filename)

Returns the latest output file with the “filename” prefix, for example “scancode-<timestamp>.json”.

get_next_run()

Returns the next non-executed Run instance assigned to current project.

get_output_file_path(name, extension)

Returns a crafted file path in the project output/ directory using given name and extension. The current date and time strings are added to the filename.

This method ensures the proper setup of the work_directory in case of a manual wipe and re-creates the missing pieces of the directory structure.

static get_root_content(directory)

Returns a list of all files and directories of a given directory. Only the first level children will be listed.

inputs(pattern='**/*')

Returns all files and directories path of the input/ directory matching a given pattern. The default **/* pattern means “this directory and all subdirectories, recursively”. Use the * pattern to only list the root content.

move_input_from(input_location)

Moves the file at input_location to the current project’s input/ directory.

reset(keep_input=True)

Resets the project by deleting all related database objects and all work directories except the input directory—when the keep_input option is True.

save(*args, **kwargs)

Saves this project instance. The workspace directories are set up during project creation.

setup_work_directory()

Creates all of the work_directory structure and skips if already existing.

walk_codebase_path()

Returns all files and directories path of the codebase/ directory recursively.

write_input_file(file_object)

Writes the provided file_object to the project’s input/ directory.

WORK_DIRECTORIES = ['input', 'output', 'codebase', 'tmp']
can_add_input

Returns True until one pipeline run has started to execute on the project.

property codebase_path

Returns the codebase directory as a Path instance.

codebaseresources

Type: Reverse ForeignKey from CodebaseResource

All codebaseresources of this project (related name of project)

created_date

Type: DateTimeField

Created date. Creation date for this project.

discoveredpackages

Type: Reverse ForeignKey from DiscoveredPackage

All discoveredpackages of this project (related name of project)

error_count

Returns the number of errors related to this project.

extra_data

Type: JSONField

Extra data. Optional mapping of extra data key/values.

file_count

Returns the number of file resources related to this project.

file_in_package_count

Returns the number of file resources in a package related to this project.

file_not_in_package_count

Returns the number of file resources not in a package related to this project.

property input_files

Returns a list of files’ relative paths in the input/ directory recursively.

property input_path

Returns the input directory as a Path instance.

property input_root

Returns a list of all files and directories of the input/ directory. Only the first level children will be listed.

input_sources

Type: JSONField

Input sources

property input_sources_list
property inputs_with_source

Returns a list of inputs including the source, type, sha256, and size data. Returns the missing_inputs defined in the input_sources field but not available in the input/ directory. Only first level children will be listed.

is_archived

Type: BooleanField

Is archived. Archived projects cannot be modified anymore and are not displayed by default in project lists. Multiple levels of data cleanup may have happened during the archive operation.

name

Type: CharField

Name. Name for this project.

property output_path

Returns the output directory as a Path instance.

property output_root

Returns a list of all files and directories of the output/ directory. Only first level children will be listed.

package_count

Returns the number of packages related to this project.

projecterrors

Type: Reverse ForeignKey from ProjectError

All projecterrors of this project (related name of project)

resource_count

Returns the number of resources related to this project.

runs

Type: Reverse ForeignKey from Run

All runs of this project (related name of project)

property tmp_path

Returns the tmp directory as a Path instance.

uuid

Type: UUIDField

Primary key: UUID

webhooksubscriptions

Type: Reverse ForeignKey from WebhookSubscription

All webhooksubscriptions of this project (related name of project)

work_directory

Type: CharField

Work directory. Project work directory location.

property work_path

Returns the work_directory as a Path instance.

CodebaseResource

class scanpipe.models.CodebaseResource

A project Codebase Resources are records of its code files and directories. Each record is identified by its path under the project workspace.

Parameters
  • id (AutoField) – Primary key: ID

  • path (CharField) – Path. The full path value of a resource (file or directory) in the archive it is from.

  • size (BigIntegerField) – Size. Size in bytes.

  • sha1 (CharField) – Sha1. SHA1 checksum hex-encoded, as in sha1sum.

  • md5 (CharField) – Md5. MD5 checksum hex-encoded, as in md5sum.

  • sha256 (CharField) – Sha256. SHA256 checksum hex-encoded, as in sha256sum.

  • sha512 (CharField) – Sha512. SHA512 checksum hex-encoded, as in sha512sum.

  • extra_data (JSONField) – Extra data. Optional mapping of extra data key/values.

  • copyrights (JSONField) – Copyrights. List of detected copyright statements (and related detection details).

  • holders (JSONField) – Holders. List of detected copyright holders (and related detection details).

  • authors (JSONField) – Authors. List of detected authors (and related detection details).

  • licenses (JSONField) – Licenses. List of license detection details.

  • license_expressions (JSONField) – License expressions. List of detected license expressions.

  • emails (JSONField) – Emails. List of detected emails (and related detection details).

  • urls (JSONField) – Urls. List of detected URLs (and related detection details).

  • rootfs_path (CharField) – Rootfs path. Path relative to some root filesystem root directory. Useful when working on disk images, docker images, and VM images.Eg.: “/usr/bin/bash” for a path of “tarball-extract/rootfs/usr/bin/bash”

  • status (CharField) – Status. Analysis status for this resource.

  • tag (CharField) – Tag

  • type (CharField) – Type. Type of this resource as one of: file, directory, symlink

  • name (CharField) – Name. File or directory name of this resource.

  • extension (CharField) – Extension. File extension for this resource (directories do not have an extension).

  • programming_language (CharField) – Programming language. Programming language of this resource if this is a code file.

  • mime_type (CharField) – Mime type. MIME type (aka. media type) for this resource. See https://en.wikipedia.org/wiki/Media_type

  • file_type (CharField) – File type. Descriptive file type for this resource.

  • is_binary (BooleanField) – Is binary

  • is_text (BooleanField) – Is text

  • is_archive (BooleanField) – Is archive

  • is_key_file (BooleanField) – Is key file

  • is_media (BooleanField) – Is media

  • compliance_alert (CharField) – Compliance alert. Indicates how the detected licenses in a codebase resource complies with provided policies.

Relationship fields:

Parameters

project (ForeignKey to Project) – Project (related name: codebaseresources)

Reverse relationships:

Parameters

discovered_packages (Reverse ManyToManyField from DiscoveredPackage) – All discovered packages of this codebase resource (related name of codebase_resources)

class Compliance(value)

List of compliance alert values.

ERROR = 'error'
MISSING = 'missing'
OK = 'ok'
WARNING = 'warning'
class Type(value)

List of CodebaseResource types.

DIRECTORY = 'directory'
FILE = 'file'
children(codebase=None)

Returns a QuerySet of direct children CodebaseResource objects using a database query on the current CodebaseResource path.

Paths are returned in lower-cased sorted path order to reflect the behavior of the commoncode.resource.Resource.children() https://github.com/nexB/commoncode/blob/main/src/commoncode/resource.py

codebase is not used in this context but required for compatibility with the commoncode.resource.VirtualCodebase class API.

compute_compliance_alert()

Computes and returns the compliance_alert value from the licenses policies.

create_and_add_package(package_data)

Creates a DiscoveredPackage instance using the package_data and assigns it to the current CodebaseResource instance.

Errors that may happen during the DiscoveredPackage creation are capture at this level, rather that in the DiscoveredPackage.create_from_data level, so resource data can be injected in the ProjectError record.

descendants()

Returns a QuerySet of descendant CodebaseResource objects using a database query on the current CodebaseResource path. The current CodebaseResource is not included.

get_compliance_alert_display(*, field=<django.db.models.CharField: compliance_alert>)

Shows the label of the compliance_alert. See get_FOO_display() for more information.

get_raw_url()

Returns the URL to access the RAW content of the resource.

get_type_display(*, field=<django.db.models.CharField: type>)

Shows the label of the type. See get_FOO_display() for more information.

inject_licenses_policy(policies_index)

Injects license policies from the policies_index into the licenses field.

save(*args, **kwargs)

Saves the current resource instance. Injects policies—if the feature is enabled—when the licenses field value is changed.

walk(topdown=True)

Returns all descendant Resources of the current Resource; does not include self.

Traverses the tree top-down, depth-first if topdown is True; otherwise traverses the tree bottom-up.

authors

Type: JSONField

Authors. List of detected authors (and related detection details).

compliance_alert

Type: CharField

Compliance alert. Indicates how the detected licenses in a codebase resource complies with provided policies.

Choices:

  • ok

  • warning

  • error

  • missing

copyrights

Type: JSONField

Copyrights. List of detected copyright statements (and related detection details).

discovered_packages

Type: Reverse ManyToManyField from DiscoveredPackage

All discovered packages of this codebase resource (related name of codebase_resources)

emails

Type: JSONField

Emails. List of detected emails (and related detection details).

extension

Type: CharField

Extension. File extension for this resource (directories do not have an extension).

extra_data

Type: JSONField

Extra data. Optional mapping of extra data key/values.

property file_content

Returns the content of the current Resource file using TextCode utilities for optimal compatibility.

file_type

Type: CharField

File type. Descriptive file type for this resource.

property for_packages

Returns the list of all discovered packages associated to this resource.

holders

Type: JSONField

Holders. List of detected copyright holders (and related detection details).

id

Type: AutoField

Primary key: ID

is_archive

Type: BooleanField

Is archive

is_binary

Type: BooleanField

Is binary

property is_dir

Returns True, if the resource is a directory.

property is_file

Returns True, if the resource is a file.

is_key_file

Type: BooleanField

Is key file

is_media

Type: BooleanField

Is media

Returns True, if the resource is a symlink.

is_text

Type: BooleanField

Is text

license_expressions

Type: JSONField

License expressions. List of detected license expressions.

licenses

Type: JSONField

Licenses. List of license detection details.

property location

Returns the location of the resource as a string.

property location_path

Returns the location of the resource as a Path instance.

md5

Type: CharField

Md5. MD5 checksum hex-encoded, as in md5sum.

mime_type

Type: CharField

Mime type. MIME type (aka. media type) for this resource. See https://en.wikipedia.org/wiki/Media_type

name

Type: CharField

Name. File or directory name of this resource.

path

Type: CharField

Path. The full path value of a resource (file or directory) in the archive it is from.

programming_language

Type: CharField

Programming language. Programming language of this resource if this is a code file.

project

Type: ForeignKey to Project

Project (related name: codebaseresources)

project_id

Internal field, use project instead.

rootfs_path

Type: CharField

Rootfs path. Path relative to some root filesystem root directory. Useful when working on disk images, docker images, and VM images.Eg.: “/usr/bin/bash” for a path of “tarball-extract/rootfs/usr/bin/bash”

sha1

Type: CharField

Sha1. SHA1 checksum hex-encoded, as in sha1sum.

sha256

Type: CharField

Sha256. SHA256 checksum hex-encoded, as in sha256sum.

sha512

Type: CharField

Sha512. SHA512 checksum hex-encoded, as in sha512sum.

size

Type: BigIntegerField

Size. Size in bytes.

status

Type: CharField

Status. Analysis status for this resource.

tag

Type: CharField

Tag

type

Type: CharField

Type. Type of this resource as one of: file, directory, symlink

Choices:

  • file

  • directory

  • symlink

property unique_license_expressions

Returns the sorted set of unique license_expressions.

urls

Type: JSONField

Urls. List of detected URLs (and related detection details).

DiscoveredPackage

class scanpipe.models.DiscoveredPackage

A project’s Discovered Packages are records of the system and application packages discovered in the code under analysis. Each record is identified by its Package URL. Package URL is a fundamental effort to create informative identifiers for software packages, such as Debian, RPM, npm, Maven, or PyPI packages. See https://github.com/package-url for more details.

Parameters
  • id (AutoField) – Primary key: ID

  • type (CharField) – Type. A short code to identify the type of this package. For example: gem for a Rubygem, docker for a container, pypi for a Python Wheel or Egg, maven for a Maven Jar, deb for a Debian package, etc.

  • namespace (CharField) – Namespace. Package name prefix, such as Maven groupid, Docker image owner, GitHub user or organization, etc.

  • name (CharField) – Name. Name of the package.

  • version (CharField) – Version. Version of the package.

  • qualifiers (CharField) – Qualifiers. Extra qualifying data for a package such as the name of an OS, architecture, distro, etc.

  • subpath (CharField) – Subpath. Extra subpath within a package, relative to the package root.

  • uuid (UUIDField) – UUID

  • last_modified_date (DateTimeField) – Last modified date. Timestamp set when a Package is created or modified

  • filename (CharField) – Filename. File name of a Resource sometimes part of the URI properand sometimes only available through an HTTP header.

  • primary_language (CharField) – Primary language. Primary programming language

  • description (TextField) – Description. Description for this package. By convention the first line should be a summary when available.

  • release_date (DateField) – Release date. The date that the package file was created, or when it was posted to its original download source.

  • homepage_url (CharField) – Homepage url. URL to the homepage for this package.

  • download_url (CharField) – Download url. A direct download URL.

  • size (BigIntegerField) – Size. Size in bytes.

  • sha1 (CharField) – Download SHA1. SHA1 checksum hex-encoded, as in sha1sum.

  • md5 (CharField) – Download MD5. MD5 checksum hex-encoded, as in md5sum.

  • bug_tracking_url (CharField) – Bug tracking url. URL to the issue or bug tracker for this package

  • code_view_url (CharField) – Code view url. a URL where the code can be browsed online

  • vcs_url (CharField) – Vcs url. a URL to the VCS repository in the SPDX form of: “git”, “svn”, “hg”, “bzr”, “cvs”, https://github.com/nexb/scancode-toolkit.git@405aaa4b3 See SPDX specification “Package Download Location” at https://spdx.org/spdx-specification-21-web-version#h.49x2ik5

  • copyright (TextField) – Copyright. Copyright statements for this package. Typically one per line.

  • license_expression (TextField) – License expression. The normalized license expression for this package as derived from its declared license.

  • declared_license (TextField) – Declared license. The declared license mention or tag or text as found in a package manifest.

  • notice_text (TextField) – Notice text. A notice text for this package.

  • manifest_path (CharField) – Manifest path. A relative path to the manifest file if any, such as a Maven .pom or a npm package.json.

  • contains_source_code (BooleanField) – Contains source code

  • extra_data (JSONField) – Extra data. Optional mapping of extra data key/values.

  • missing_resources (JSONField) – Missing resources

  • modified_resources (JSONField) – Modified resources

  • dependencies (JSONField) – Dependencies. A list of dependencies for this package.

  • package_uid (CharField) – Package uid. Unique identifier for this package.

  • keywords (JSONField) – Keywords

  • source_packages (JSONField) – Source packages

Relationship fields:

Parameters
classmethod create_from_data(project, package_data)

Creates and returns a DiscoveredPackage for a project from the package_data. If one of the values of the required fields is not available, a “ProjectError” is created instead of a new DiscoveredPackage instance.

classmethod extract_purl_data(package_data)
classmethod purl_fields()
update_from_data(package_data, override=False)

Update this discovered package instance with the provided package_data. The save() is called only if at least one field was modified.

bug_tracking_url

Type: CharField

Bug tracking url. URL to the issue or bug tracker for this package

code_view_url

Type: CharField

Code view url. a URL where the code can be browsed online

codebase_resources

Type: ManyToManyField to CodebaseResource

Codebase resources (related name: discovered_packages)

contains_source_code

Type: BooleanField

Contains source code

copyright

Type: TextField

Copyright. Copyright statements for this package. Typically one per line.

declared_license

Type: TextField

Declared license. The declared license mention or tag or text as found in a package manifest.

dependencies

Type: JSONField

Dependencies. A list of dependencies for this package.

description

Type: TextField

Description. Description for this package. By convention the first line should be a summary when available.

download_url

Type: CharField

Download url. A direct download URL.

extra_data

Type: JSONField

Extra data. Optional mapping of extra data key/values.

filename

Type: CharField

Filename. File name of a Resource sometimes part of the URI properand sometimes only available through an HTTP header.

homepage_url

Type: CharField

Homepage url. URL to the homepage for this package.

id

Type: AutoField

Primary key: ID

keywords

Type: JSONField

Keywords

last_modified_date

Type: DateTimeField

Last modified date. Timestamp set when a Package is created or modified

license_expression

Type: TextField

License expression. The normalized license expression for this package as derived from its declared license.

manifest_path

Type: CharField

Manifest path. A relative path to the manifest file if any, such as a Maven .pom or a npm package.json.

md5

Type: CharField

Download MD5. MD5 checksum hex-encoded, as in md5sum.

missing_resources

Type: JSONField

Missing resources

modified_resources

Type: JSONField

Modified resources

name

Type: CharField

Name. Name of the package.

namespace

Type: CharField

Namespace. Package name prefix, such as Maven groupid, Docker image owner, GitHub user or organization, etc.

notice_text

Type: TextField

Notice text. A notice text for this package.

package_uid

Type: CharField

Package uid. Unique identifier for this package.

primary_language

Type: CharField

Primary language. Primary programming language

project

Type: ForeignKey to Project

Project (related name: discoveredpackages)

project_id

Internal field, use project instead.

property purl

Returns the Package URL.

qualifiers

Type: CharField

Qualifiers. Extra qualifying data for a package such as the name of an OS, architecture, distro, etc.

release_date

Type: DateField

Release date. The date that the package file was created, or when it was posted to its original download source.

sha1

Type: CharField

Download SHA1. SHA1 checksum hex-encoded, as in sha1sum.

size

Type: BigIntegerField

Size. Size in bytes.

source_packages

Type: JSONField

Source packages

subpath

Type: CharField

Subpath. Extra subpath within a package, relative to the package root.

type

Type: CharField

Type. A short code to identify the type of this package. For example: gem for a Rubygem, docker for a container, pypi for a Python Wheel or Egg, maven for a Maven Jar, deb for a Debian package, etc.

uuid

Type: UUIDField

UUID

vcs_url

Type: CharField

Vcs url. a URL to the VCS repository in the SPDX form of: “git”, “svn”, “hg”, “bzr”, “cvs”, https://github.com/nexb/scancode-toolkit.git@405aaa4b3 See SPDX specification “Package Download Location” at https://spdx.org/spdx-specification-21-web-version#h.49x2ik5

version

Type: CharField

Version. Version of the package.

ProjectError

class scanpipe.models.ProjectError

Stores errors and§ exceptions raised during a pipeline run.

Parameters
  • uuid (UUIDField) – Primary key: UUID

  • created_date (DateTimeField) – Created date

  • model (CharField) – Model. Name of the model class.

  • details (JSONField) – Details. Data that caused the error.

  • message (TextField) – Message. Error message.

  • traceback (TextField) – Traceback. Exception traceback.

Relationship fields:

Parameters

project (ForeignKey to Project) – Project (related name: projecterrors)

created_date

Type: DateTimeField

Created date

details

Type: JSONField

Details. Data that caused the error.

message

Type: TextField

Message. Error message.

model

Type: CharField

Model. Name of the model class.

project

Type: ForeignKey to Project

Project (related name: projecterrors)

project_id

Internal field, use project instead.

traceback

Type: TextField

Traceback. Exception traceback.

uuid

Type: UUIDField

Primary key: UUID

Run

class scanpipe.models.Run

The Database representation of a pipeline execution.

Parameters

Relationship fields:

Parameters

project (ForeignKey to Project) – Project (related name: runs)

append_to_log(message, save=False)

Appends the message string to the log field of this Run instance.

execute_task_async()

Enqueues the pipeline execution task for an asynchronous execution.

make_pipeline_instance()

Returns a pipelines instance using this Run pipeline_class.

profile(print_results=False)

Returns computed execution times for each step in the current Run.

If print_results is provided, the results are printed to stdout.

send_project_subscriptions()

Triggers related project webhook subscriptions.

set_scancodeio_version()

Sets the current ScanCode.io version on the Run.scancodeio_version field.

sync_with_job()

Synchronise this Run instance with its related RQ Job.

This is required when a Run gets out of sync with its Job, this can happen when the worker or one of its processes is killed, the Run status is not properly updated and may stay in a Queued or Running state forever.

In case the Run is out of sync of its related Job, the Run status will be updated accordingly. When the run was in the queue, it will be enqueued again.

created_date

Type: DateTimeField

Created date

description

Type: TextField

Description

log

Type: TextField

Log

property pipeline_class

Returns this Run pipeline_class.

pipeline_name

Type: CharField

Pipeline name. Identify a registered Pipeline class.

project

Type: ForeignKey to Project

Project (related name: runs)

project_id

Internal field, use project instead.

scancodeio_version

Type: CharField

Scancodeio version

task_end_date

Type: DateTimeField

Task end date

task_exitcode

Type: IntegerField

Task exitcode

task_id

Type: UUIDField

Task id

task_output

Type: TextField

Task output

task_start_date

Type: DateTimeField

Task start date

uuid

Type: UUIDField

Primary key: UUID