Differences between revisions 11 and 14 (spanning 3 versions)
Revision 11 as of 2017-01-31 13:59:00
Size: 5992
Comment: Add section Other copyright files and license-related tools
Revision 14 as of 2017-02-01 16:11:58
Size: 7298
Comment: Add example use of licensecheck. Improve description and example use of licensecheck2dep5.
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
Reviewing upstream packages to write debian/copyright files is tedious but important manual work. It is done during initial packaging and after every new upstream release. ## Note on editing: Please use semantic newlines, to ease readability (also of emailed diffs).

Reviewing upstream packages to write debian/copyright files is tedious but important manual work.
It is done during initial packaging and after every new upstream release.
Line 9: Line 12:
 * `licensecheck` from DebianPackage:licensecheck (and older versions of DebianPackage:devscripts) can scan source code and report found copyright holders and known licenses. Its approach is to detect licenses with a dataset (medium:~200 regexes) of regex patterns and key phrases (parts) and to reassemble these in detected licenses based on rules. In that sense this is somewhat similar to the combined approaches of Fossology/nomos and Ninka (see below for these tools). It also detects copyright statements. It output results in plain text (with customizable delimiter) or a Debian copyright file format. This is a command line tool written in Perl. == license-check ==
Line 11: Line 14:
`licensecheck` from DebianPackage:licensecheck (and older versions of DebianPackage:devscripts) can scan source code
and report found copyright holders and known licenses.
Its approach is to detect licenses with a dataset (medium:~200 regexes) of regex patterns and key phrases (parts)
and to reassemble these in detected licenses based on rules.
In that sense this is somewhat similar to the combined approaches of Fossology/nomos and Ninka (see below for these tools).
It also detects copyright statements.
It output results in plain text (with customizable delimiter) or a Debian copyright file format.
This is a command line tool written in Perl.
Line 12: Line 23:
 * `scan-copyrights` from DebianPackage:libconfig-model-dpkg-perl can update an existing copyright file from rescanning the source. It can also create one from scratch. It uses `licensecheck`. {{{
licensecheck --check '.*' --recursive --deb-machine --lines 0 *
}}}
Line 14: Line 27:
 * Config::Model can update Debian copyright files using the `cme` command (from DebianPackage:cme or DebianPackage:libconfig-model-dpkg-perl less than 2.063): == scan-copyrights ==

`scan-copyrights` from DebianPackage:libconfig-model-dpkg-perl can update an existing copyright file from rescanning the source.
It can also create one from scratch.
It uses `licensecheck`.

== cme ==

Config::Model can update Debian copyright files using the `cme` command
(from DebianPackage:cme or DebianPackage:libconfig-model-dpkg-perl less than 2.063):
Line 20: Line 42:
 * A script from DebianPackage:cdbs can generate a copyright file using `licensecheck`: == licensecheck2dep5 ==

A script from DebianPackage:cdbs can create a copyright file by tidying output from `licensecheck`:
Line 23: Line 47:
licensecheck --copyright -r `find * -type f` | \
 
/usr/lib/cdbs/licensecheck2dep5 > debian/copyright.auto
licensecheck --check '.*' --recursive --copyright --deb-fmt --lines 0 * | /usr/lib/cdbs/licensecheck2dep5
Line 27: Line 50:
 * `license-reconcile` compares the existing copyright with the source code and reports discrepancies. == license-reconsile ==
Line 29: Line 52:
 * `debmake -k` also compares the existing copyright with the source code and reports discrepancies. `license-reconcile` compares the existing copyright with the source code and reports discrepancies.
Line 31: Line 54:
 * `debmake -cc` generates a new copyright file from the source code. == debmake ==
Line 33: Line 56:
 * `decopy` generates debian/copyright files. `debmake -k` also compares the existing copyright with the source code and reports discrepancies.
Line 35: Line 58:
 * `licensee` from DebianPackage:ruby-licensee checks LICENSE files and returns known license names. This is the [[https://github.com/benbalter/licensee| tool used by Github]] to provide a summary license indication on a repository main page. Its approach is to search for typical LICENSE file names or some package manifest (NPM, Bower, Gemfile, etc) and perform an exact or approximate license text matching against the set of common licenses texts as published at [[https://choosealicense.com]] (small: ~20). It output results in YAML format. This is a command line tool written in Ruby. `debmake -cc` generates a new copyright file from the source code.
Line 37: Line 60:
 * [[https://www.fossology.org/|FOSSology]] is a open source license compliance software system and toolkit that [[https://debconf16.debconf.org/talks/100/|can]] (in version 3.1) generate DEP5 copyright files. Its approach is to detect licenses with a either large (large:~6000 regexes) dataset of regex patterns (nomos) or a full string comparison against license full texts (large: ~400 text) (monk). It also detects copyright statements and does also integrate with Ninka (see below). This is a complete database-backed web application with some command line support written in C/C++ with a PHP frontend. == decopy ==
Line 39: Line 62:
 * [[https://github.com/pivotal/LicenseFinder|LicenseFinder]] is a tool that "Find licenses for your project's dependencies." It does so by running application-specific package management tools and detecting package manifests to collect license-related metadata (e.g. Gemfile, etc) and detect licensing using regex against a set of common license texts (small: ~20). It output results in CSV, HTML and other report format. This is a command line tool written in Ruby. [[https://anonscm.debian.org/git/collab-maint/decopy.git/|decopy]] is a tool that "automates creating and updating the debian/copyright files."
It also "aims to detects as many licenses as possible" which makes it a tool for license detection too.
It uses `python-debian` to handle Debian machine readable copyright files.
Its approach to detect licenses is the same as `license-checker`.
This is a command line tool written in Python.
Line 41: Line 68:
 * [[https://github.com/dmgerman/ninka|Ninka]] is a "license identification tool for Source Code". Its approach is to detect licenses from text sentences using a dataset of key license sentences (large: ~600) and assemble the results based on the matched sentences. It output results in CSV format. This is a command line tool written in Perl. == licensee ==
Line 43: Line 70:
 * [[https://github.com/nexB/scancode-toolkit/|ScanCode]] is a tool "to scan code and detect licenses, copyrights and more". Its approach is to detect licenses using a dataset of plain license texts (large:~1000 texts) and plain text notices (large:~2500 notices and mentions) and finds exact and approximate matches in source and binaries using full text alignments. It also detects copyright statements and collect license metadata from package manifests (e.g Maven, Pypi, etc.). It output results in JSON, HTML or SPDX format. This is a command line tool written in Python. `licensee` from DebianPackage:ruby-licensee checks LICENSE files and returns known license names.
This is the [[https://github.com/benbalter/licensee| tool used by Github]] to provide a summary license indication on a repository main page.
Its approach is to search for typical LICENSE file names or some package manifest (NPM, Bower, Gemfile, etc)
and perform an exact or approximate license text matching against the set of common licenses texts
as published at [[https://choosealicense.com]] (small: ~20).
It output results in YAML format.
This is a command line tool written in Ruby.

== fossology ==

[[https://www.fossology.org/|FOSSology]] is a open source license compliance software system and toolkit
that [[https://debconf16.debconf.org/talks/100/|can]] (in version 3.1) generate DEP5 copyright files.
Its approach is to detect licenses with a either large (large:~6000 regexes) dataset of regex patterns (nomos)
or a full string comparison against license full texts (large: ~400 text) (monk).
It also detects copyright statements and does also integrate with Ninka (see below).
This is a complete database-backed web application with some command line support written in C/C++ with a PHP frontend.

== license_finder ==

[[https://github.com/pivotal/LicenseFinder|LicenseFinder]] is a tool that "Find licenses for your project's dependencies."
It does so by running application-specific package management tools
and detecting package manifests to collect license-related metadata (e.g. Gemfile, etc)
and detect licensing using regex against a set of common license texts (small: ~20).
It output results in CSV, HTML and other report format.
This is a command line tool written in Ruby.

== ninka ==

[[https://github.com/dmgerman/ninka|Ninka]] is a "license identification tool for Source Code".
Its approach is to detect licenses from text sentences using a dataset of key license sentences (large: ~600)
and assemble the results based on the matched sentences.
It output results in CSV format.
This is a command line tool written in Perl.

== scancode ==

[[https://github.com/nexB/scancode-toolkit/|ScanCode]] is a tool "to scan code and detect licenses, copyrights and more".
Its approach is to detect licenses using a dataset of plain license texts (large:~1000 texts)
and plain text notices (large:~2500 notices and mentions)
and finds exact and approximate matches in source and binaries using full text alignments.
It also detects copyright statements and collect license metadata from package manifests (e.g Maven, Pypi, etc.).
It output results in JSON, HTML or SPDX format.
This is a command line tool written in Python.
Line 50: Line 119:
  * [[dlt|https://github.com/agustinhenze/dlt/]] has support for parsing and creating copyright files.
  * [decopy|https://anonscm.debian.org/git/collab-maint/decopy.git/]] is a tool that "automates creating and updating the debian/copyright files." and also "decopy aims to detects as many licenses as possible" which would make it a tool for license detection too. It uses `python-debian` to handle copyright files.
  * [[Debian packaging tools|https://github.com/xolox/python-deb-pkg-tools]] is "a collection of functions to work with Debian packages and repositories. It uses `python-debian` to handle copyright files.
  * [[https://github.com/agustinhenze/dlt/|dlt]] has support for parsing and creating Debian machine readable copyright files.
  * [[https://github.com/xolox/python-deb-pkg-tools|Debian packaging tools]] is "a collection of functions to work with Debian packages and repositories. It uses `python-debian` to handle Debian machine readable copyright files.
Line 54: Line 122:
 * In Java:
  * [[https://forge.ow2.org/projects/oslcv3/|OSLCv3]] Open Source License Checker 3.0 is a "risk management tool for analyzing open source software licenses." It detects licenses using key sentences and diffs using a dataset of license texts (small: ~50). It is developed in Java and seems no longer under development since 2009.
  * [[https://github.com/whitesource/jninka/|jninka]] is a port from Perl to Java of `ninka`.
  * [[https://github.com/apache/creadur-rat/| Apache Creadur rat]] is a "tool to improve accuracy and efficiency when checking releases." . It's goal is to help Apache Foundation projects to comply with the release policy including detecting licenses. Its approach is to use a key sentences dataset (small: ~20).

Reviewing upstream packages to write debian/copyright files is tedious but important manual work. It is done during initial packaging and after every new upstream release.

Making initial copyright file construction and subsequent review/update easier will improve Debian's software quality.

Starting with Stretch (Debian 9) there are significantly improved tools over previous releases to help.

Note that some of the tools listed here are run by check-all-the-things -f copyright.

license-check

licensecheck from licensecheck (and older versions of devscripts) can scan source code and report found copyright holders and known licenses. Its approach is to detect licenses with a dataset (medium:~200 regexes) of regex patterns and key phrases (parts) and to reassemble these in detected licenses based on rules. In that sense this is somewhat similar to the combined approaches of Fossology/nomos and Ninka (see below for these tools). It also detects copyright statements. It output results in plain text (with customizable delimiter) or a Debian copyright file format. This is a command line tool written in Perl.

licensecheck --check '.*' --recursive --deb-machine --lines 0 *

scan-copyrights

scan-copyrights from libconfig-model-dpkg-perl can update an existing copyright file from rescanning the source. It can also create one from scratch. It uses licensecheck.

cme

Config::Model can update Debian copyright files using the cme command (from cme or libconfig-model-dpkg-perl less than 2.063):

cme update dpkg-copyright

licensecheck2dep5

A script from cdbs can create a copyright file by tidying output from licensecheck:

licensecheck --check '.*' --recursive --copyright --deb-fmt --lines 0 * | /usr/lib/cdbs/licensecheck2dep5

license-reconsile

license-reconcile compares the existing copyright with the source code and reports discrepancies.

debmake

debmake -k also compares the existing copyright with the source code and reports discrepancies.

debmake -cc generates a new copyright file from the source code.

decopy

decopy is a tool that "automates creating and updating the debian/copyright files." It also "aims to detects as many licenses as possible" which makes it a tool for license detection too. It uses python-debian to handle Debian machine readable copyright files. Its approach to detect licenses is the same as license-checker. This is a command line tool written in Python.

licensee

licensee from ruby-licensee checks LICENSE files and returns known license names. This is the tool used by Github to provide a summary license indication on a repository main page. Its approach is to search for typical LICENSE file names or some package manifest (NPM, Bower, Gemfile, etc) and perform an exact or approximate license text matching against the set of common licenses texts as published at https://choosealicense.com (small: ~20). It output results in YAML format. This is a command line tool written in Ruby.

fossology

FOSSology is a open source license compliance software system and toolkit that can (in version 3.1) generate DEP5 copyright files. Its approach is to detect licenses with a either large (large:~6000 regexes) dataset of regex patterns (nomos) or a full string comparison against license full texts (large: ~400 text) (monk). It also detects copyright statements and does also integrate with Ninka (see below). This is a complete database-backed web application with some command line support written in C/C++ with a PHP frontend.

license_finder

LicenseFinder is a tool that "Find licenses for your project's dependencies." It does so by running application-specific package management tools and detecting package manifests to collect license-related metadata (e.g. Gemfile, etc) and detect licensing using regex against a set of common license texts (small: ~20). It output results in CSV, HTML and other report format. This is a command line tool written in Ruby.

ninka

Ninka is a "license identification tool for Source Code". Its approach is to detect licenses from text sentences using a dataset of key license sentences (large: ~600) and assemble the results based on the matched sentences. It output results in CSV format. This is a command line tool written in Perl.

scancode

ScanCode is a tool "to scan code and detect licenses, copyrights and more". Its approach is to detect licenses using a dataset of plain license texts (large:~1000 texts) and plain text notices (large:~2500 notices and mentions) and finds exact and approximate matches in source and binaries using full text alignments. It also detects copyright statements and collect license metadata from package manifests (e.g Maven, Pypi, etc.). It output results in JSON, HTML or SPDX format. This is a command line tool written in Python.

Other copyright files and license-related tools

  • In Python:
    • python-debian has support parsing and creating copyright files (and any Debian-style files such as description, control, Sources, Packages, etc.)

    • dlt has support for parsing and creating Debian machine readable copyright files.

    • Debian packaging tools is "a collection of functions to work with Debian packages and repositories. It uses python-debian to handle Debian machine readable copyright files.

  • In Java:
    • OSLCv3 Open Source License Checker 3.0 is a "risk management tool for analyzing open source software licenses." It detects licenses using key sentences and diffs using a dataset of license texts (small: ~50). It is developed in Java and seems no longer under development since 2009.

    • jninka is a port from Perl to Java of ninka.

    • Apache Creadur rat is a "tool to improve accuracy and efficiency when checking releases." . It's goal is to help Apache Foundation projects to comply with the release policy including detecting licenses. Its approach is to use a key sentences dataset (small: ~20).

See also