- Status of DocBook XML Transition
- How to convert DebianDoc SGML source into DocBook XML
- Note: UTF-8 and DebianDoc SGML
Status of DocBook XML Transition
For example for bug #175064, it can be done with the following.
$ bts user 'firstname.lastname@example.org' , usertags 175064 docbook-xml-transition
In order to make this migration smooth, you require to have debiandoc-sgml version 1.2.20 or newer supporting debiandoc2dbk (wheezy version). If you are using squeeze environment, installing wheezy version directly onto your system is good enough (this is a Perl script).
How to convert DebianDoc SGML source into DocBook XML
Conversion from SGML to XML.
Let's assume we have followings in your working directory:
- manual.en.sgml manual.xx.po manual.yy.po funky.ent
Please note you need to install
- docbook-xsl moreutils libxml2-utils
Step 0: Prepare source
Copy example scripts in /usr/share/doc/debiandoc-sgml/examples to the working directories and make them executables.
Step 1: Check DebianDoc SGML (en) is in good shape
Verify ?DebianDoc SGML (en) source being usable by building html file.
$ debiandoc2html -1 manual.en.sgml
If you need files like funky.ent, generate them now.
Step 2: Convert DebianDoc SGML (en) to DocBook XML (basic)
$ debiandoc2dbk -1 manual.en.sgml
This should work now without problem. (If not, investigate...)
Verify generated XML file by building HTML via this ?DocBook XML for English. This can be done by:
$ ./debiandoc2dbkpo --html-dbk manual
If this builds html files OK, you have converted to ?DocBook XML files of English without comments and all entities are embedded.
Step 3: Convert DebianDoc SGML (non-en) to DocBook XML (basic)
$ ./debiandoc2dbkpo --html-dbk manual xx yy
If this builds html files OK, you have converted to ?DocBook XML files of English and non-English without comments and all entities are embedded.
This /debiandoc2dbkpo script runs ./debiandoc-lint4po script to analyses PO file and removes broken msgstr for most usual cases.
"E: ..." should not happen and may halt processing.
"W: ..." is likely to exist. These are not critical and ./debiandoc-lint4po script will remove those translation which are not suitable for following PO files for the ?DocBook XML files.
Please note this is just a first step to check you have decent source.
Step 4: Debug original PO files
In order to have the best conversion result, let's improve the health of the PO files.
The previous commands should have listed many warnings and possibly errors.
You can extract broken PO file as follows.
$ ./debiandoc-lint4po -v -u <$MANUAL.xx.po >$MANUAL.xx.unlint.po $ ./debiandoc-lint4po -v -u <$MANUAL.yy.po >$MANUAL.yy.unlint.po
This should help you fix glitches in the original PO file.
If it is not too much trouble, please fix such problem in original PO files ?DebianDoc SGML (non-en). This will improve quality of conversion but is not critical if you do not care loss of these parts.
Step 5: Keep entities
The above conversion embeds all the entities into converted ?DocBook XML files and all the comments in the source is lost.
Here ia a bit more complicated way for conversion but automated with ./debiandoc2dbkpo.
In order to preserve *.ent, you create touched up version of it (them) by the following:
$ mv funkey.ent funkey-orig.ent $ ./debiandoc2dbk-unent <funkey-orig.ent > funky.ent
Verify SGML to XML conversion while including this touched-up funky.ent by building HTML via XML for English and Language(xx and yy).
$ ./debiandoc2dbkpo --html-po manual xx yy
This prompt you to
### Edit manual.en.dbk.new to include entity reference and type ENTER. ###
Then go to another terminal and put the following at the top of manual.en.dbk.new.
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [ <!ENTITY % funkey SYSTEM "funkey-orig.ent" > %funkey; ]>
in where you see the following at the top of manual.en.dbk.
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [ ]>
Then type "ENTER" to continue.
This should work in most cases but may fail while creating PO files (*.??.dbk.po and *.??.dbk.pox) and . I will discuss possible sources of problems later.
If this build PO and html file OK, you have converted ?DocBook XML of English and Language(xx and yy) without comments.
Since above PO creation uses msgtranslated to reset msgstr contents if they are the same as msgid contents, generated PO files contain untranslated strings. You may wish to run PO file editor such as poedit to check and touch up files.
$ for i in $(ls manual.*.pox|sed -e 's/^manual.//g' -e 's/.pox$//'); do poedit manual.$i.pox ; done
If you are happy with this and you wish to keep original translator headers, you can do th following
$ for i in $(ls manual.*.pox|sed -e 's/^manual.//g' -e 's/.pox$//'); do sed -i -e '1,/^msgid "en"/!d' manual.$i.po ; sed -e '1,/^msgid "en"/d' manual.$i.pox >> manual.$i.po; done
Step 6: Keep comments
Normal conversion process will strip comments. The idea is to convert comments
<!--- comment ... --->
<p>=====COMMENT===== comment ... =====TNEMMOC=====</p>
so these can be restored later. As long as comments are located between normal paragraph, example script ./debiandoc2dbk-wrap does good enough job. You may need some manual edits prior to using this. All comments before <book> and after </book> needs to be removed to start with.
$ edit manual.en.sgml $ ./debiandoc2dbk-wrap <manual.en.sgml >manual-comment.en.sgml $ ./debiandoc2dbkpo manual-comment
This is a bit of trial-and-errors. You do it until you get html.
If you are successful, you create ?DocBook XML with comments by
$ ./debiandoc2dbk-unwrap <manual-comment.en.dbk >manual.en.dbk
If you use this as xml file, you may get some fuzzy due to space differences. There you may need manual tweaks.
Problems and their solution
Sometimes, translator add locale specific modification which can work in SGML but generated XML may not make one-on-one correspondence.
These are fundamental problem and needs to be addressed manually.
If translator decided to add some new contents and sneaked in them by smart addition of contents, they may cause problem. ./debiandoc-lint4po script should have removed most of those by now.
If source have different contents in English SGML with different content level while translated text happens to be the same, then you see:
... msgid (at maint-guide.en.lint.dbk:36) is of type 'Content of: <book><chapter><section><itemizedlist><listitem><itemizedlist><listitem><para>' while msgstr (at maint-guide.ja.lint.dbk:36 maint-guide.ja.lint.dbk:59) is of type 'Content of: <book><chapter><section><itemizedlist><listitem><para>'. Original text: <literal>while quilt push; do quilt refresh; done</literal> to apply all patches while removing <emphasis>fuzz</emphasis>; Translated text: <literal>while quilt push; do quilt refresh; done</literal> として <emphasis>fuzz</emphasis> を削除しながら全てのパッチを適用します。 (result so far dumped to gettextization.failed.po) ...
This ERROR needs to be worked around by adding bogus content to one of the contents in SGML PO file.
msgid "foo foo and foo" msgstr "bar bar XX bar" msgid "foo foo, and foo" msgstr "bar bar XX bar"
msgid "foo foo and foo" msgstr "bar bar XX bar" msgid "foo foo, and foo" msgstr "bar bar XX bar[XXX_FIXME1_XXX]"
These [XXX_FIXME.*_XXX] can be recoverted in final ?DocBook XML and its PO files by your manual touch-up. Since these are so common, ./debiandoc2dbk-ent can handle recovery of them.
Sometimes, translator places additional contents within additional <footnote>...</footnote>. For this type, the PO file proofing script does best effort to retain translation by mangling tags within <footnote>...</footnote>.
msgid "foo foo and foo" msgstr "bar bar <footnote>XX</footnote> bar"
msgid "foo foo and foo" msgstr "bar bar @@@[tagopen_footnote]@@@XX@@@[tagclose_footnote]@@@ bar"
These contents are preserved via ?DocBook conversion but there usually need some manual touch-ups later to the source PO file and *.ent file.
Note: UTF-8 and DebianDoc SGML
Recent debiandoc-sgml package supports UTF-8 encoded source and generated files. But this is a hack.