move design stuff to the project's readme file
|Deletions are marked like this.||Additions are marked like this.|
|Line 45:||Line 45:|
|At this point, the code can extract all files from a mirror efficiently, although it has not been tested on a full mirror yet.||At this point, the code can extract all files from a mirror efficiently, although it has not been tested on a full mirror yet, because TheAnarcat is waiting for `manpages` group access and dependencies install (`setuptools` from backports and/or `python-click`, `python-apt` and `python-debian`).|
https://manpages.debian.org/ is a service providing online manpages in HTML format for the public.
Debian manpages archive
Service Name: Debian manpages archive
Service URI: https://manpages.debian.org/
Service Short Description: Tool to search for man pages in the different Debian releases
Service Documentation: manpages.debian.org
Source Code: https://github.com/Debian/debiman
Service Status : active
Service hosting: debian.org
There are three known implementations of "man to web" archive generators.
The current codebase is a set of Perl and bash CGI scripts that dynamically generate (and search through) manpages.
The current codebase could be migrated to manziarly, provided we have access to the manpages group.
The current codebase extracts manpages with dpkg --fsys-tarfile and the tar tar commands. It also creates indexes using man -k for future searches. Manpages are stored in a directory for each package-version, so it doesn't garbage-collect disappeard manpages. It also appears that packages are always extracted, even if they had been parsed before.
The CGI script just calls man and outputs plain text wrapped in <PRE> tags.
There is also a copy of the Ubuntu scripts in the source code.
It looks like there's a bash and python implementation of the same thing. They process the whole archive (which is assumed to be local) and create a timestamp file for every package found, which avoids processing packages repeatedly (but all packages from the Packages listing are stat'd at every run). In the bash version, the manpages are extracted with dpkg -x, in the Python version as well, athough it uses the apt python package to list files, and uses a simple regex (^usr/share/man/.*\.gz$) to find manpages.
It keeps a cache of the md5sum of the package in "$PUBLIC_HTML_DIR/manpages/$dist/.cache/$name to avoid looking at known packages. The bash version only looks at the timestamp of the file versus the package, and only checks at the modification year.
To generate the HTML version of the manpage, both programs use the /usr/lib/w3m/cgi-bin/w3mman2html.cgi shipped with the w3m package.
Seach is operated by a custom Python script that looks through manpages filenames or uses Google to do a full text search.
A new codebase written by dgilman is available in github. It is a simple Python script with a sqlite backend. It extracts the tarfile with dpkg --fsys-tarfile then parses it with the Python tarfile library. It uses rather complicated regexes to find manpages and stores various apropos and metadata about manpages in the sqlite database. All manpages are unconditionnally extracted.
After careful analysis of the above options, TheAnarcat started working on his own design, detailed here: https://anonscm.debian.org/git/collab-maint/debmans.git/tree/README.md
At this point, the code can extract all files from a mirror efficiently, although it has not been tested on a full mirror yet, because TheAnarcat is waiting for manpages group access and dependencies install (setuptools from backports and/or python-click, python-apt and python-debian).
HTML conversion is not implemented yet, nor is searching, although it is expected that existing tools will make those steps more about integration than programming.
The old service used to run on glinka.debian.org. Teams/DSA requested the service should be moved to manziarly.debian.org.
Note that to configure a vhost on DSA machines, you need to follow the DSA subdomains documentation.
Discussions about manpages.debian.org can take place on the regular Teams/DDP channels, for example the #debian-doc IRC channel and email@example.com mailing list.
You can also subscribe to this wiki page to get updates, which also functions as a ad-hoc forum.