scancode-toolkit has a large number of dependencies that are currently missing from Debian. I'm ultimately only really after code that can detect copyrights/licenses so it can be imported by decopy, rather than the entire framework (because of the large number of dependencies).
Ideally, what I'd like to use is a Python library that doesn't require any initialization or configuration and that I can pass the contents of a file and for which I can get back the license and copyright information:
def detect_copyright_and_licenses(text: bytes) -> Union[List[Info],
FullLicense]:
"""Reads the contents of a file and returns license info.
For pure license files, just returns FullLicense object,
otherwise returns a list of Info objects.
"""
class FullLicense:
license_id: LicenseId
text: str
class Info:
# For every field, offset information in the file
# List of copyright holders
copyright_holders: List[Tuple[Year, Name]]
# License identifier (e.g. "GPL-2.0 or Apache-2.0-or-later")
license_id: LicenseExpression
# the license header and warranty statement as found in the file
# verbatim
reference: str
Status
In progress/done
- license_expression (upload sponsored)
Ideally avoided, doesn't seem essential for license detection
- debut: necessary for debian output and debian package information detection (unnecessary for my use case)
- Philippe says: please explain then what you use case is
- spdx: only used for SPDX output
- packageurl: used for package information detection
- banal: doesn't actually appear to be imported? (but mentioned in requirements.txt)
- normality: doesn't actually appear to be imported? (but mentioned in requirements.txt)
Philippe says: banal and normality are deps of fingerprints, used to create copyright summaries in summarycode (see below). The requirements.txt are from a pip freeze and are at full depth.
- urlpy - only used in two places, seems like it could be replaced with e.g. use of yarl (does canonicalization), purl urlobject or urllib.parse?
- Philippe says: urlpy is used for it canonicalization capabilities (something other libs may not be able to do AFAIK)
- text_unidecode: replace with GPLed unidecode ?
- Philippe says: this is a non GPLed port of unidecode specifically used here so it is compatible with the Apache license.
- dparse: used for python package information detection
- gemfileparser: used for ruby package information detection
- jsonstreams: only used for json output
Unsure
- fingerprints: used only in src/summarycode/copyright_summary.py
- commoncode: common library functions
- Philippe says: shared utilities pretty much used everywhere
- plugincode:
- Philippe says: base utility to define and load pluggy-based setuptools plugins. Removing that would imply a rewrite of most of the core
- typecode: ideally avoided, but perhaps too integrated with the rest of scancode?
- Philippe says: this is quite integrated and hard to remove
- typecode-libmagic
Philippe says: this is an essential dep of typecode. If you use scancode-toolkit-mini you could instead use https://github.com/nexB/scancode-plugins/tree/develop/builtins/typecode_libmagic_system_provided and depend on a Debian libmagic package to get the .so and magicdb. This typecode_libmagic_system_provided has been created specifically to support adding to debian.
dependencies of other things not packaged in Debian
- ordlookup: Philippe says: used for package information detection, dep of pefile used to analyze Windows PE binaries
- ftfy: Philippe says: used for package information detection, used to remove mojibake from some Windows PE metadata
- pymaven: Philippe says: used for package information detection, used to parse Java Maven POMs
- spdx_tools: Philippe says: used for SPDX output