Debian Documentation project considers the use of the modern ?DocBook XML in UTF-8 environment is better than using the older ?DebianDoc SGML.

Status of DocBook XML Transition

The user tag "docbook-xml-transition" is set for debian-doc@lists.debian.org team to track this migration of documents in the Debian archive.

For example for bug #175064, it can be done with the following.

$ bts user 'debian-doc@lists.debian.org' , usertags 175064 docbook-xml-transition

In order to make this migration smooth, you require to have debiandoc-sgml version 1.2.20 or newer supporting debiandoc2dbk (wheezy version). If you are using squeeze environment, installing wheezy version directly onto your system is good enough (this is a Perl script).

How to convert DebianDoc SGML source into DocBook XML

Conversion from SGML to XML.

Let's assume we have followings in your working directory:

Please note you need to install

Step 0: Prepare source

Copy example scripts in /usr/share/doc/debiandoc-sgml/examples to the working directories and make them executables.

Step 1: Check DebianDoc SGML (en) is in good shape

Verify ?DebianDoc SGML (en) source being usable by building html file.

 $ debiandoc2html -1 manual.en.sgml

If you need files like funky.ent, generate them now.

Step 2: Convert DebianDoc SGML (en) to DocBook XML (basic)

Test to convert from ?DebianDoc SGML (en) source to ?DocBook XML.

 $ debiandoc2dbk -1 manual.en.sgml

This should work now without problem. (If not, investigate...)

Verify generated XML file by building HTML via this ?DocBook XML for English. This can be done by:

 $ ./debiandoc2dbkpo --html-dbk manual

If this builds html files OK, you have converted to ?DocBook XML files of English without comments and all entities are embedded.

Step 3: Convert DebianDoc SGML (non-en) to DocBook XML (basic)

Verify ?DebianDoc SGML to ?DocBook XML conversion from PO for Language(xx and yy).

 $ ./debiandoc2dbkpo --html-dbk manual xx yy

If this builds html files OK, you have converted to ?DocBook XML files of English and non-English without comments and all entities are embedded.

This /debiandoc2dbkpo script runs ./debiandoc-lint4po script to analyses PO file and removes broken msgstr for most usual cases.

"E: ..." should not happen and may halt processing.

"W: ..." is likely to exist. These are not critical and ./debiandoc-lint4po script will remove those translation which are not suitable for following PO files for the ?DocBook XML files.

Please note this is just a first step to check you have decent source.

Step 4: Debug original PO files

In order to have the best conversion result, let's improve the health of the PO files.

The previous commands should have listed many warnings and possibly errors.

You can extract broken PO file as follows.

 $ ./debiandoc-lint4po -v -u <$MANUAL.xx.po >$MANUAL.xx.unlint.po
 $ ./debiandoc-lint4po -v -u <$MANUAL.yy.po >$MANUAL.yy.unlint.po

This should help you fix glitches in the original PO file.

If it is not too much trouble, please fix such problem in original PO files ?DebianDoc SGML (non-en). This will improve quality of conversion but is not critical if you do not care loss of these parts.

Step 5: Keep entities

The above conversion embeds all the entities into converted ?DocBook XML files and all the comments in the source is lost.

Here ia a bit more complicated way for conversion but automated with ./debiandoc2dbkpo.

In order to preserve *.ent, you create touched up version of it (them) by the following:

 $ mv funkey.ent funkey-orig.ent
 $ ./debiandoc2dbk-unent <funkey-orig.ent > funky.ent

Verify SGML to XML conversion while including this touched-up funky.ent by building HTML via XML for English and Language(xx and yy).

 $ ./debiandoc2dbkpo --html-po manual xx yy

This prompt you to

### Edit manual.en.dbk.new to include entity reference and type ENTER. ###

Then go to another terminal and put the following at the top of manual.en.dbk.new.

<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
  "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [
  <!ENTITY % funkey   SYSTEM "funkey-orig.ent" > %funkey;
]>

in where you see the following at the top of manual.en.dbk.

<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
  "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [
]>

Then type "ENTER" to continue.

This should work in most cases but may fail while creating PO files (*.??.dbk.po and *.??.dbk.pox) and . I will discuss possible sources of problems later.

If this build PO and html file OK, you have converted ?DocBook XML of English and Language(xx and yy) without comments.

Since above PO creation uses msgtranslated to reset msgstr contents if they are the same as msgid contents, generated PO files contain untranslated strings. You may wish to run PO file editor such as poedit to check and touch up files.

 $ for i in $(ls manual.*.pox|sed -e 's/^manual.//g' -e 's/.pox$//'); do poedit manual.$i.pox ; done

If you are happy with this and you wish to keep original translator headers, you can do th following

 $ for i in $(ls manual.*.pox|sed -e 's/^manual.//g' -e 's/.pox$//'); do sed -i -e '1,/^msgid "en"/!d' manual.$i.po ; sed -e '1,/^msgid "en"/d' manual.$i.pox >> manual.$i.po; done

Step 6: Keep comments

Normal conversion process will strip comments. The idea is to convert comments

 <!--- comment ... --->

into

 <p>=====COMMENT===== 
 comment ... 
 =====TNEMMOC=====</p>

so these can be restored later. As long as comments are located between normal paragraph, example script ./debiandoc2dbk-wrap does good enough job. You may need some manual edits prior to using this. All comments before <book> and after </book> needs to be removed to start with.

 $ edit manual.en.sgml
 $ ./debiandoc2dbk-wrap <manual.en.sgml >manual-comment.en.sgml
 $ ./debiandoc2dbkpo manual-comment

This is a bit of trial-and-errors. You do it until you get html.

If you are successful, you create ?DocBook XML with comments by

 $ ./debiandoc2dbk-unwrap <manual-comment.en.dbk >manual.en.dbk

If you use this as xml file, you may get some fuzzy due to space differences. There you may need manual tweaks.

Problems and their solution

Sometimes, translator add locale specific modification which can work in SGML but generated XML may not make one-on-one correspondence.

Sometimes, DTD model of ?DeianDoc SGML may be different from ?DocBook XML which makes normal conversion difficult.

These are fundamental problem and needs to be addressed manually.

If translator decided to add some new contents and sneaked in them by smart addition of contents, they may cause problem. ./debiandoc-lint4po script should have removed most of those by now.

If source have different contents in English SGML with different content level while translated text happens to be the same, then you see:

...
msgid (at maint-guide.en.lint.dbk:36) is of type 'Content of: <book><chapter><section><itemizedlist><listitem><itemizedlist><listitem><para>' while
msgstr (at maint-guide.ja.lint.dbk:36 maint-guide.ja.lint.dbk:59) is of type 'Content of: <book><chapter><section><itemizedlist><listitem><para>'.
Original text: <literal>while quilt push; do quilt refresh; done</literal> to apply all patches while removing <emphasis>fuzz</emphasis>;
Translated text: <literal>while quilt push; do quilt refresh; done</literal> として <emphasis>fuzz</emphasis> を削除しながら全てのパッチを適用します。
(result so far dumped to gettextization.failed.po)
...

This ERROR needs to be worked around by adding bogus content to one of the contents in SGML PO file.

From:

msgid "foo foo and foo"
msgstr "bar bar XX bar"

msgid "foo foo, and foo"
msgstr "bar bar XX bar"

To:

msgid "foo foo and foo"
msgstr "bar bar XX bar"
 
msgid "foo foo, and foo"
msgstr "bar bar XX bar[XXX_FIXME1_XXX]"

These [XXX_FIXME.*_XXX] can be recoverted in final ?DocBook XML and its PO files by your manual touch-up. Since these are so common, ./debiandoc2dbk-ent can handle recovery of them.

Sometimes, translator places additional contents within additional <footnote>...</footnote>. For this type, the PO file proofing script does best effort to retain translation by mangling tags within <footnote>...</footnote>.

From:

msgid "foo foo and foo"
msgstr "bar bar <footnote>XX</footnote> bar"

To:

msgid "foo foo and foo"
msgstr "bar bar @@@[tagopen_footnote]@@@XX@@@[tagclose_footnote]@@@ bar"

These contents are preserved via ?DocBook conversion but there usually need some manual touch-ups later to the source PO file and *.ent file.

Note: UTF-8 and DebianDoc SGML

Recent debiandoc-sgml package supports UTF-8 encoded source and generated files. But this is a hack.