Differences between revisions 15 and 17 (spanning 2 versions)
Revision 15 as of 2016-11-11 16:56:03
Size: 6349
Editor: TheAnarcat
Comment:
Revision 17 as of 2016-11-11 17:28:06
Size: 6716
Editor: TheAnarcat
Comment:
Deletions are marked like this. Additions are marked like this.
Line 48: Line 48:
    * layout options:
      * Ubuntu: `$DISTRIB_CODENAME/$LOCALE/man$i/$PAGE.$i.gz` (see [[http://manpages.ubuntu.com/dman|dman]])
      * current codebase: `"${OUTPUTDIR}/${pooldir}/${packagename}_${version}"` (from [[https://anonscm.debian.org/viewvc/ddp/man-cgi/extractor/manpage-extractor.pl?view=markup|manpage-extractor.pl]])
Line 63: Line 66:
 2. ask (through a [[rt.debian.org]] ticket) access to the `manpages` group  2. ask (through a [[rt.debian.org]] ticket) access to the `manpages` group (./) asked access for `anarcat`

https://manpages.debian.org/ is a service providing online manpages in HTML format for the public.

Current status

Debian manpages archive

Possible implementations

There are three known implementations of "man to web" archive generators.

Current codebase

The current codebase is a set of Perl and bash CGI scripts that dynamically generate (and search through) manpages.

The current codebase could be migrated to manziarly, provided we have access to the manpages group.

The current codebase extracts manpages with dpkg --fsys-tarfile and the tar tar commands. It also creates indexes using man -k for future searches. Manpages are stored in a directory for each package-version, so it doesn't garbage-collect disappeard manpages.

The CGI script just calls man and outputs plain text wrapped in <PRE> tags.

There is also a copy of the Ubuntu scripts in the source code.

Ubuntu

Ubuntu has their own manpage repository at https://manpages.ubuntu.com/. Their codebase is partly Python, Perl and Bash.

It looks like there's a bash *and* python implementation of the same thing. They process the whole archive (which is assumed to be local) and create a timestamp file for every package found, which avoids processing packages repeatedly (but all packages from the Packages listing are stat'd at every run). In the bash version, the manpages are extracted with dpkg -x, in the Python version as well, athough it uses the apt python package to list files, and uses a simple regex (^usr/share/man/.*\.gz$) to find manpages.

To generate the HTML version of the manpage, both programs use the /usr/lib/w3m/cgi-bin/w3mman2html.cgi shipped with the w3m package.

Seach is operated by a custom Python script that looks through manpages filenames or uses Google to do a full text search.

dgilman codebase

A new codebase written by dgilman is available in github. It is a simple Python script with a sqlite backend. It extracts the tarfile with dpkg --fsys-tarfile then parses it with the Python tarfile library. It uses rather complicated regexes to find manpages and stores various apropos and metadata about manpages in the sqlite database. All manpages are unconditionnally extracted.

anarcat design

The Minimum Viable Product for this project is a service that creates an HTML version of all the manpages of all the packages available in Debian, for all supported suites (including LTS). Note that the current codebase does not attempt to parse the manpage to generate headers, only the text is output.

apropos(1) functionality is considered extra that can be implemented later with already indexing tools like Xapian (or the web frontend, Omega), Lucene / Solr, Elastic search, or a simple homegrown javascript-based search (like readthedocs uses).

A possible design would be:

  1. fetch all manpages from the archive, store them on disk (makes them usable for tools like dman that browses remote webpages)

    • layout options:
      • Ubuntu: $DISTRIB_CODENAME/$LOCALE/man$i/$PAGE.$i.gz (see dman)

      • current codebase: "${OUTPUTDIR}/${pooldir}/${packagename}_${version}" (from manpage-extractor.pl)

  2. convert manpages to HTML so they are readable in a web browser, possible solutions here:
    • just the plaintext output of man wrapped in <PRE> tags

    • man2html is an old C program that ships with a bunch of CGI scripts

    • there's another man2html that is a perl script, but I couldn't figure out how to use it correctly.

    • w3m has a Perl script that is used by the Ubuntu site

    • roffit is another perl script. the version in Debian is ancient (2012) and doesn't display the man(1) synopsis correctly (newer versions from github also fail)

    • pandoc can't, unfortunately, read manpages (only write)

    • man itself can generate an HTML version with man -Hcat man and the output is fairly decent, although there is no cross-referencing

  3. index HTML pages in a search engine of some sort

parts 1 and 2 would be generated on manziarly and stored on the static.d.o CDN (see below). parts 3 would be a separate (pair or?) server(s?) to run the search cluster.

next steps:

  1. write the MVP, maybe based on David's work
  2. ask (through a rt.debian.org ticket) access to the manpages group (./) asked access for anarcat

  3. deploy a first dump of the manpages on manziarly
  4. make a patch to the dsa-puppet manifests or document how to deploy the scripts for the DSA

  5. ask DSA to deploy the new code, test
  6. if it works, fix the manpages.debian.org DNS to point to the static.d.o DNS. at this point, the MVP is in place

  7. make search work...

in the above setup, manziarly would be a master server for static file servers in the Debian.org infrastructure. Files saved there would be rsync'd to multiple frontend servers. How this is configured is detailed in the static-mirroring DSA documentation, but basically, we would need to ask the DSA team for an extra entry for manpages.d.o there to server static files.

Hardware

The old service used to run on glinka.debian.org. Teams/DSA requested the service should be moved to manziarly.debian.org.

Note that to configure a vhost on DSA machines, you need to follow the DSA subdomains documentation.

Forum

Discussions about manpages.debian.org can take place on the regular Teams/DDP channels, for example the #debian-doc IRC channel and debian-doc@lists.debian.org mailing list.

You can also subscribe to this wiki page to get updates, which also functions as a ad-hoc forum.