Add more license detection tools
Fix minor typos
|Deletions are marked like this.||Additions are marked like this.|
|Line 162:||Line 162:|
|=== scancode ===||=== scancode-toolkit ===|
|Line 166:||Line 166:|
|and plain text notices (large:~15,000 notices and mentions) and finds exact and approximate matches in source and binaries
using full text alignments. It can also return the exact matched text.
|and plain text notices (large:~15,000 notices and mentions) and finds exact and approximate matches in source and binaries using full text alignments. It can also return the exact matched text.|
|Line 184:||Line 183:|
Command-line tools in Debian
Reviewing upstream packages to write debian/copyright files is tedious but important manual work. It is done during initial packaging and after every new upstream release.
Making initial copyright file construction and subsequent review/update easier will improve Debian's software quality.
Starting with Stretch (Debian 9) there are significantly improved tools over previous releases to help.
licensecheck from licensecheck (and older versions of devscripts) can scan source code and report found copyright holders and known licenses. Its approach is to detect licenses with a dataset (medium:~200 regexes) of regex patterns and key phrases (parts) and to reassemble these in detected licenses based on rules. In that sense this is somewhat similar to the combined approaches of Fossology/nomos and Ninka (see below for these tools). It also detects copyright statements. It output results in plain text (with customizable delimiter) or a Debian copyright file format. Written in Perl.
licensecheck --check '.*' --recursive --deb-machine --lines 0 *
cme update dpkg-copyright
Usage is detailed in Config::Model wiki
licensecheck --check '.*' --recursive --copyright --deb-fmt --lines 0 * | /usr/lib/cdbs/licensecheck2dep5
A script from cdbs can extract structured metadata embedded in binary content, for subsequent parsing by #licensecheck and suffix stripping by #licensecheck2dep5. Written in Perl, using Image::ExifTool and Font::TTF.
find -type f -name '*.png' -print0 | perl -0 /usr/lib/cdbs/license-miner licensecheck --check '.*' --ignore '.+\.png$' --recursive --copyright --deb-fmt --lines 0 * | /usr/lib/cdbs/licensecheck2dep5 find -type f -name '*.png.metadata' -delete
A makefile from cdbs can automate selection, mining, parsing, and cleanup, comparing previously autogenerated file debian/copyright_hints included with source package with freshly autogenerated instance and warning about newly introduced (but not disappearing) changes to discovered hints, using #license-miner and #licensecheck and #licensecheck2dep5 under the hood. Written in make.
Typical use is by shipping a package-specific script `debian/copyright-check with source package and executing that script manually (not as part of normal build) when sources change:
export DEB_COPYRIGHT_EXTRACT_EXTS="icc pdf png ttf" export DEB_COPYRIGHT_EXTRACT_PATHS_EXIF="Resource/Font/" export DEB_COPYRIGHT_CHECK_IGNORE_EXTS="cat ico xls pcl xps" export DEB_COPYRIGHT_CHECK_IGNORE_PATHS="doc/.*\.htm" export DEB_COPYRIGHT_CHECK_MERGE_SAME_LICENSE=yes make -f /usr/share/cdbs/1/rules/utils.mk pre-build || true make -f /usr/share/cdbs/1/rules/utils.mk clean DEB_COPYRIGHT_CHECK_STRICT=1
license-reconcile compares the existing copyright with the source code and reports discrepancies. Written in Perl, using licensecheck.
debmake -k also compares the existing copyright with the source code and reports discrepancies.
debmake -cc generates a new copyright file from the source code.
decopy is a tool that "automates creating and updating the debian/copyright files." It also "aims to detects as many licenses as possible" which makes it a tool for license detection too. It uses python-debian to handle Debian machine readable copyright files. Its approach to detect licenses is the same as license-checker. Written in Python, using python-debian.
licensee from ruby-licensee checks LICENSE files and returns known license names. This is the tool used by Github to provide a summary license indication on a repository main page. Its approach is to search for typical LICENSE file names or some package manifest (NPM, Bower, Gemfile, etc) and perform an exact or approximate license text matching against the set of common licenses texts as published at https://choosealicense.com (small: ~20). It output results in YAML format. Written in Ruby.
Wrapper for some of the other tools listed here.
check-all-the-things -f copyright
Automated license checking for rust. cargo lichking is a Cargo subcommand that checks licensing information for dependencies, based on David A. Wheeler's compatibility graph.
cargo lichking check
Libraries in Debian
python-debian has support parsing and creating copyright files (and any Debian-style files such as description, control, Sources, Packages, etc.) Written in Python.
Command-line tools not in Debian
LicenseFinder is a tool that "Find licenses for your project's dependencies." It does so by running application-specific package management tools and detecting package manifests to collect license-related metadata (e.g. Gemfile, etc) and detect licensing using regex against a set of common license texts (small: ~20). It output results in CSV, HTML and other report format. Written in Ruby.
licensed has been recently released by ?GitHub to check the licenses of the dependencies of a project. Modern language package managers (bower, bundler, cabal, go, npm, stack) are used to pull the dependency chain of a specific project. Licenses can be configured to be either accepted or rejected, easing the developer task of identifying problematic dependencies when importing a new third-party library.
ScanCode is a tool "to scan code and detect licenses, copyrights and more". Its approach is to detect licenses using a dataset of plain license texts (large:~1,500 texts) and plain text notices (large:~15,000 notices and mentions) and finds exact and approximate matches in source and binaries using full text alignments. It can also return the exact matched text. It also detects copyright statements and collects license metadata from package manifests (e.g Maven, Pypi, etc.). It output results in JSON, HTML or SPDX format. Written in Python.
Apache Creadur rat is a "tool to improve accuracy and efficiency when checking releases." . Its goal is to help Apache Foundation projects to comply with the release policy including detecting licenses. Its approach is to use a key sentences dataset (small: ~20). Written in Java.
Other tools that need further detailing and review
daald/dpkg-licenses "A command line tool which lists the licenses of all installed packages in a Debian-based system (like Ubuntu)". Wriiten in Shell script.
fossology/atarashi "Atarashi scans for license statements in open source software, focusing on text statistics. Designed to work stand-alone and with FOSSology". Written in Python.
codeauroraforum/lid "License Identifier. The purpose of this program, license_identifier, is to scan the source code files and identify the license text region and the type of license.". Written in Python.
google/licenseclassifier "A License Classifier". Written in Go.
google/licensecheck "The licensecheck package classifies license files and heuristically determines how well they correspond to known open source licenses". Written in Go.
src-d/go-license-detector "Reliable project licenses detector." Written in Go.
google/go-licenses "Reports on the licenses used by a Go package and its dependencies". Written in Go.
jfrog/go-license-discovery "A go library for matching text against known OSS licenses". Written in Go.
boyter/lc "licensechecker (lc) a command line application which scans directories and identifies what software license things are under producing reports as either SPDX, CSV, JSON, XLSX or CLI Tabular output". Written in Go.
nexB/debut "A python library to parse Debian deb822-style control and copyright files". Written in Python.
FOSSology is a open source license compliance software system and toolkit that can (in version 3.1) generate DEP5 copyright files. Its approach is to detect licenses with a either large (large:~2500 regexes) dataset of regex patterns (nomos) or a full string comparison against license full texts (large: ~400 text) (monk). It also detects copyright statements and does also integrate with Ninka (see below). This is a complete database-backed web application with some command line support written in C/C++ with a PHP frontend.
OSLCv3 Open Source License Checker 3.0 is a "risk management tool for analyzing open source software licenses." It detects licenses using key sentences and diffs using a dataset of license texts (small: ~50). It is developed in Java and seems no longer under development since 2009.
Ninka is a "license identification tool for Source Code". Its approach is to detect licenses from text sentences using a dataset of key license sentences (large: ~600) and assemble the results based on the matched sentences. It output results in CSV format. Written in Perl. Unmaintained since 2017.
jninka is a port from Perl to Java of ninka. Written in Java. Unmaintained/retired project.
gerv/slic "Speedy LIcense Checker and associated tools". No longer maintainer since the death of its author.
dlt has support for parsing and creating Debian machine readable copyright files. Written in Python. Unmaintained/retired project.
Updating debian copyright file with cme by Dominique Dumont
Creating, updating and checking debian/copyright semi-automatically by Petter Reinholdtsen
Bachelor Thesis: A Comparison Study of Open Source License Crawlers (PDF) by Thomas Wolter
https://github.com/maxhbr/LicenseScannerComparison A comparison of license scanners.
https://osr.cs.fau.de/2019/08/07/final-thesis-a-comparison-study-of-open-source-license-crawler/ and https://web.archive.org/web/20200128142101/https://osr.cs.fau.de/wp-content/uploads/2019/08/wolter_2019.pdf A comparison of license scanners.
ClearlyDefined Massive license scanning (with scancode) and peer review for license clarity and correctness.