ScanPipe Concepts

Project

A project encapsulates the analysis of software code:

  • It has a Project workspace, which is a directory that contains the software code files under analysis.

  • It makes use of one or more code analysis Pipelines scripts to automate the code analysis process.

  • It tracks Codebase Resources, i.e. its code files and directories

  • It tracks Discovered Packages, i.e. system and application packages origin and license discovered in the codebase.

In the database, a project is identified by its unique name.

Note

Multiple analysis pipelines can be run on a single project.

Project workspace

A project workspace is the root directory where a project’s files are stored.

The following directories exist under the workspace directory:

  • input/ contains all uploaded files used as the input of a project, such as a codebase archive.

  • codebase/ contains files and directories - i.e. resources - tracked as CodebaseResource records in the database.

  • output/ contains any output files created by the pipelines, including reports, scan results, etc.

  • tmp/ is a scratch pad for temporary files generated during pipelines runs.

Pipelines

A pipeline is a Python script that contains a series of steps, which are executed sequentially to perform a code analysis.

It usually starts with the uploaded input files, which might need to be extracted first. Then, it generates CodebaseResource records in the database accordingly.

Those resources can then be analyzed, scanned, and matched as needed. Analysis results and reports are eventually posted at the end of a pipeline run.

All Built-in Pipelines are located in the scanpipe.pipelines module. Each pipeline consists of a Python script and includes one subclass of the Pipeline class. Each step is a method of the Pipeline class. The execution order of the steps - or the sequence of steps execution - is declared through the steps class attribute.

Tip

Refer to Custom Pipelines for details about adding custom pipelines to ScanCode.io.

Note

You can assign one or more pipelines to a project as a sequence.

Pipes

As mentioned above, pipelines include a group of operations—Pipes—that are combined in a chain-like fashion and executed in orderly manner. Pipes are simply the building blocks of a given pipeline.

For example, the following operations—Steps—are included in the RootFS pipeline, and they are leveraging pipes to accomplish pre-defined tasks:

from scanpipe.pipelines import Pipeline
from scanpipe.pipes import flag
from scanpipe.pipes import rootfs
from scanpipe.pipes import scancode

class RootFS(Pipeline):
    [...]

    def flag_empty_files(self):
        """
        Flags empty files.
        """
        flag.flag_empty_files(self.project)

    def scan_for_application_packages(self):
        """
        Scans unknown resources for packages information.
        """
        scancode.scan_for_application_packages(self.project)

Note

All built-in pipes are located in the scanpipe.pipes module. Pipes are grouped by type in modules, e.g. codebase, input, output, scancode.

Refer to our Pipes section for information about available pipes and their usage.

Codebase Resources

A project Codebase Resources are records of its code files and directories. CodebaseResource is a database model and each record is identified by its path under the project workspace.

The following are some of the CodebaseResource attributes:

  • A status, which is used to track the analysis status for this resource.

  • A type, such as a file, a directory or a symlink

  • Various attributes to track detected copyrights, license expressions, copyright holders, and related packages.

Note

Please note that ScanCode-toolkit use the same attributes and attribute names for files.

Discovered Packages

A project Discovered Packages are records of the system and application packages discovered in the code under analysis. DiscoveredPackage is a database model and each record is identified by its Package URL. Package URL is a fundamental effort to create informative identifiers for software packages, such as Debian, RPM, npm, Maven, or PyPI packages. See https://github.com/package-url for more details.

The following are some of the DiscoveredPackage attributes:

  • A type, name, version (all Package URL attributes)

  • A homepage_url, download_url, and other URLs

  • Checksums, such as SHA1, MD5

  • Copyright, license_expression, and declared_license

Note

Please note that ScanCode-toolkit use the same attributes and attribute names for packages.