Differences between revisions 73 and 74
Revision 73 as of 2008-05-24 05:33:50
Size: 39687
Editor: OsamuAoki
Comment:
Revision 74 as of 2008-05-24 05:46:57
Size: 39727
Editor: OsamuAoki
Comment:
Deletions are marked like this. Additions are marked like this.
Line 510: Line 510:
|| {{{graphicsmagick}}} || 244 || - || image(bitmap) || Image manipulation programs. (folk of {{{imagemagick}}}) ||
Line 512: Line 513:
|| {{{icoutils}}} || - || - || png<->ico || Converts MS Windows icons and cursors to and from PNG formats ||
|| {{{xpm2wico}}} || - || - || xpm->ico || Converts XPM to MS Windows icon formats ||
|| {{{icoutils}}} || - || - || png<->ico(bitmap) || Converts MS Windows icons and cursors to and from PNG formats ||
|| {{{xpm2wico}}} || - || - || xpm->ico(bitmap) || Converts XPM to MS Windows icon formats ||
|| {{{openoffice.org-draw}}} || - || - || image(vector) || OpenOffice.org office suite - drawing ||
Line 517: Line 519:
|| {{{openoffice.org-draw}}} || - || - || image(vector) || OpenOffice.org office suite - drawing ||
Line 519: Line 520:
|| {{{fig2sxd}}} || - || - || fig->sxd || Convert XFig files to OpenOffice.org Draw format ||
Line 521: Line 521:
|| {{{libwmf-bin}}} || 570 || - || Windows/image || Windows metafile (vector graphic data) conversion tools. ||
|| {{{graphicsmagick}}} || 244 || - || image || Image manipulation programs. (folk of {{{imagemagick}}}) ||
|| {{{libwmf-bin}}} || 570 || - || Windows/image(vector) || Windows metafile (vector graphic data) conversion tools. ||
|| {{{fig2sxd}}} || - || - || fig->sxd(vector) || Convert XFig files to OpenOffice.org Draw format ||
Line 539: Line 539:
|| {{{findimagedupes}}} || - || - || image->fingerprint || Finds visually similar or duplicate images ||
Line 540: Line 541:
|| {{{findimagedupes}}} || - || - || image->fingerprint || Finds visually similar or duplicate images ||

Do not use Edit(GUI) button.

?TableOfContents(4)

Copyright 2007, 2008 Osamu Aoki GPL, (Please agree to GPL, GPL2, and any version of GPL which is compatible with DSFG if you update any part of wiki page)

I welcome your contributions to update the wiki pages. You must follow these rules:

  • Do not use Edit(GUI) button of MoinMoin.

  • You can update anytime for:
    • grammar errors
    • spelling errors
    • moved URL location
    • package name transition adjustment (emacs23 etc.)
    • clearly broken script.
  • Before updating real contents:

Data conversion

Standard based tools are in very good shape but support for proprietary data formats are limited.

Text data conversion tools

Following packages for the text data conversion caught my eyes:

List of text data conversion tools.

1

2

3

package

popcon

size

keyword

function

libc6

37751

-

charset

The text encoding conversion between locales with iconv command. (fundamental)

recode

1039

-

charset+eol

The text encoding conversion between locales. (versatile, more aliases and features)

konwert

250

-

charset

The text encoding conversion between locales. (fancy)

nkf

235

-

charset

The character set translator for Japanese.

tcs

27

-

charset

The character set translator.

unaccent

20

-

charset

Replace accented letters by their unaccented equivalent.

tofrodos

851

-

eol

The text format converter between DOS and Unix: fromdos and todos

macutils

136

-

eol

The text format converter between Macintosh and Unix: frommac and tomac

Basics of encoding

The default text data format on the Debian system uses the [http://en.wikipedia.org/wiki/UTF-8 UTF-8] encoding of the Unicode character for "LANG=xx_YY.UTF-8" (see: @{@langvariable@}@ and @{@thelocale@}@). For LANG=C", the [http://en.wikipedia.org/wiki/ASCII ASCII] character set is used instead.

The [http://en.wikipedia.org/wiki/UTF-8 UTF-8] encoding system is a multibyte code sequence and uses code points smartly. ASCII data, which consist only with 7-bit range codes, are always UTF-8 data.

For character sets which fit in single byte such as [http://en.wikipedia.org/wiki/ASCII ASCII] and [http://en.wikipedia.org/wiki/ISO/IEC_8859 ISO-8859] character sets, the [http://en.wikipedia.org/wiki/Character_encoding character encoding] means almost the same thing as the character set.

For character sets with many characters such as [http://en.wikipedia.org/wiki/JIS_X_0213 JIS X 0213] for Japanese or [http://en.wikipedia.org/wiki/Universal_Character_Set Universal Character Set (UCS, Unicode, ISO-10646-1)] for practically all languages, there are many encoding schemes to fit them into the sequence of the byte data, such as [http://en.wikipedia.org/wiki/EUC EUC] and [http://en.wikipedia.org/wiki/ISO/IEC_2022 ISO/IEC 2022 (also known as JIS X 0202)] for Japanese, or [http://en.wikipedia.org/wiki/UTF-8 UTF-8] and [http://en.wikipedia.org/wiki/UTF-32/UCS-4 UTF-32/UCS-4] for Unicode. So there are clear differentiation between the character set and the character encoding.

The [http://en.wikipedia.org/wiki/Code_page code page] is used as the synonym to the character encoding tables for the vender specific ones.

List of encoding values and their usage.

encoding value

usage

[http://en.wikipedia.org/wiki/ASCII ASCII]

Standard US 7 bit code. Simply set locale to "LANG=C".

[http://en.wikipedia.org/wiki/UTF-8 UTF-8]

Standard multilingual compatibility] for all modern OSs.

[http://en.wikipedia.org/wiki/ISO/IEC_2022 ISO-2022-JP]

Standard encoding for Japanese e-mail which uses only 7 bit codes.

[http://en.wikipedia.org/wiki/ISO/IEC_8859-1 ISO-8859-1]

Old standard for western European languages, ASCII+accented characters.

[http://en.wikipedia.org/wiki/ISO/IEC_8859-2 ISO-8859-2]

Old standard for eastern European languages, ASCII+accented characters.

[http://en.wikipedia.org/wiki/ISO/IEC_8859-15 ISO-8859-15]

Old standard for western European languages, ASCII+accented characters+euro sign.

[http://en.wikipedia.org/wiki/EUC eucJP]

Old Japanese UNIX standard 8 bit code and completely different from shift-jis

[http://en.wikipedia.org/wiki/KOI8-R KOI8-R]

Old Russian UNIX standard for the Cyrillic alphabet.

[http://en.wikipedia.org/wiki/Windows-1252 CP1252]

Code page 1252, Microsoft Windows style ISO-8859-15 variant.

[http://en.wikipedia.org/wiki/Windows-1251 CP1251]

Code page 1251, Microsoft Windows style encoding for the Cyrillic alphabet.

[http://en.wikipedia.org/wiki/Shift-jis Shift-JIS]

JIS X 0208 Appendix 1 standard, for Japanese.

[http://en.wikipedia.org/wiki/Code_page_932 CP932]

Code page 932, Microsoft Windows style shift-jis variant, for Japanese.

[http://en.wikipedia.org/wiki/Code_page_936 CP936]

Code page 936, Microsoft Windows style [http://en.wikipedia.org/wiki/GB2312 GB2312], [http://en.wikipedia.org/wiki/GBK GBK], or [http://en.wikipedia.org/wiki/GB18030 GB18030] variant, for Simplified Chinese.

[http://en.wikipedia.org/wiki/Code_page_949 CP949]

Code page 949, Microsoft Windows style [http://en.wikipedia.org/wiki/Extended_Unix_Code#EUC-KR EUC-KR] or Unified Hangul Code variant, for Korean.

[http://en.wikipedia.org/wiki/Code_page_950 CP950]

Code page 950, Microsoft Windows style [http://en.wikipedia.org/wiki/Big5 Big5] variant, for Traditional Chinese.

The vender specific old non-UTF-8 encoding systems tend to have minor but annoying diferences on some charactrs such as graphic ones for many countries. The deplyment of the UTF-8 system by the modern OSs practically solved these conflicting encoding issues.

To convert a text file with iconv

The iconv command converts the encoding of characters:

$ iconv -f encoding1 -t encoding2 input.txt >output.txt

The encoding name part may be normalized internally to achieve cross platform compatibility by removing all - and by converting all characters into lower case. It is also referred as character code set. The supported encodings can be checked by the "iconv -l" command.

Since this iconv command is provided as a part of the libc6 package, it is always available.

The recode command may be used too and offers more than the combined functionality of the iconv, fromdos, todos, frommac, and tomac commands. For more, see pertinent description in the info recode.

(!) Please note most encoding systems share the same code with ASCII for the 7 bit characters. But there are some exceptions. If you are converting old Japanese C programs and URLs data from the casually-called shift-JIS encoding format to UTF-8 format, use "CP932" as the encoding name instead of "shift-JIS" to get the expected results: 0x5C -> "\" and 0x7E -> "~" . Otherwise, these are converted to wrong characters.

To convert file names with iconv

Here is an example script to convert encoding of the file name from ones created under older OS to modern UTF-8 ones for the simple case.

ENCDN=iso-8859-1
for x in *;
 do
 mv "$x" $(echo "$x" | iconv -f $ENCDN -t utf-8)
done

The "$ENCDN" variable should be set by the encoding values from @{@listofencodingvauesandtheirusage@}@ .

For more complicated case, please mount disk drive containing such file names with proper encoding as the mount(8) option and copy entire disk to another disk drive mounted as UTF-8 with "cp -a" command.

EOL conversion

The text file format, specifically the end-of-line (EOL) code, is dependent on the platform:

List of EOL conversion tools.

platform

EOL code

EOL control sequence

EOL ASCII value

Debian (unix)

LF

^J

10

MSDOS and Windows

CR-LF

^M^J

13, 10

Apple's Macintosh

CR

^M

13

The EOL format conversion programs, fromdos(1), todos(1), frommac(1), and tomac(1), are quite handy. The recode command is also useful.

{i} The use of "sed -e '/\r$/!s/$/\r/'" instead of "todos" is better when you want to unify the EOL style to the MSDOS style from the mixed MSDOS and Unix style. (e.g., after merging 2 MSDOS style files with diff3.) This is because "todos" adds CR to all lines.

(!) Some data on the Debian system, such as the wiki page data for the python-moinmoin, use MSDOS style CR-LF as the EOL code. So the above rule is just general rule.

(!) Most editors (eg. vim, emacs, gedit, ...) can handle files in MSDOS style EOL transparently.

TAB conversion

You can expand the tab code in the text to the multiple spaces in vim using the ":retab" command.

There are few popular specialized programs to convert the tab codes:

List of TAB conversion commands from bsdmainutils and coreutils packages.

function

bsdmainutils

coreutils

expand tab to spaces

"col -x"

expand

unexpand tab from spaces

"col -h"

unexpand

The indent(1) from the indent package completely reformats whitespaces in the C program.

Editors with auto-conversion

Intelligent modern editors such as the vim program are quite smart and copes well with any encoding systems and any file formats. You should use these editors under the UTF-8 locale in the UTF-8 capable console for the best compatibility.

An old western European Unix text file, "u-file.txt", stored in the latin1 encoding can be edited simply with vim as:

$ vim u-file.txt

This is possible since the auto detection mechanism of the file encoding in vim assumes the UTF-8 encoding first and, if it fails, assumes it to be latin1.

An old Polish Unix text file, "pu-file.txt", stored in the latin2 encoding can be edited with vim as:

$ vim '+e ++enc=latin2 pu-file.txt'

An old Japanese unix text file, "ju-file.txt", stored in the eucJP encoding can be edited with vim as:

$ vim '+e ++enc=eucJP ju-file.txt'

An old Japanese MS-Windows text file, "jw-file.txt", stored in the so called shift-JIS encoding (more precisely: CP932) can be edited with vim as:

$ vim '+e ++enc=CP932 ++ff=dos jw-file.txt'

When a file is opened with "++enc" and "++ff" options, the "w" in the Vim command line stores it in the original format and overwrite the original file. You can also specify the saving format and the file name in the Vim command line, e.g., "w ++enc=utf8 new.txt".

Please refer to the mbyte.txt "multi-byte text support" in vim on-line help.

The emacs family of programs can perform the equivalent functions.

Plain text extraction

Following will read a web page into a text file. This is very useful when copying configurations off the Web or applying basic Unix text tools such as grep on the web page.

$ lynx -dump http://www.remote-site.com/help-info.html >textfile

Similarly, you can extract plain text data from other formats using followings:

List of tools to extract plain text data.

1

2

3

package

popcon

size

keyword

function

html2text

4926

-

html->text

An advanced HTML to text converter. (Better than "lynx -dump")

w3m

6313

-

html->text

An HTML to text converter with the "w3m -dump" command.

lynx

4662

-

html->text

An HTML to text converter with the "lynx -dump" command.

elinks

1343

-

html->text

An HTML to text converter with the "elinks -dump" command.

links

1148

-

html->text

An HTML to text converter with the "links -dump" command.

links2

598

-

html->text

An HTML to text converter with the "links2 -dump" command.

antiword

477

-

MSWord->text,ps

This converts !MSWord files to plain text or ps.

catdoc

333

-

MSWord->text,TeX

This converts !MSWord files to plain text or TeX.

pstotext

199

-

ps/pdf->text

Extract text from PostScript and PDF files.

unhtml

39

-

html->text

Remove the markup tags from an HTML file.

odt2txt

33

-

odt->text

The converter from OpenDocument Text to text.

wpd2sxw

33

-

?WordPerfect->sxw

?WordPerfect to OpenOffice.org/!?StarOffice writer document converter.

Highlighting and formatting plain text data

List of tools to highlight plain text data.

1

2

3

package

popcon

size

keyword

function

vim-runtime

1849

-

highlight

Vim can convert source code to HTML with :source $VIMRUNTIME/syntax/html.vim (vim MACRO)

cxref

163

-

c->html

The converter for the C program to latex and HTML. (C)

src2tex

66

-

highlight

This convert many source codes to TeX. (C)

source-highlight

56

-

highlight

This convert many source codes to HTML, XHTML, LaTeX, Texinfo, ANSI color escape sequences and DocBook files with highlight. (C++)

highlight

47

-

highlight

This convert many source codes to HTML, XHTML, RTF, LaTeX, TeX or XSL-FO files with highlight. (C++)

grc

30

-

text->color

The generic colouriser for everything. (Python)

txt2html

88

-

text->html

Text to HTML converter. (Perl)

markdown

74

-

text->html

The converter from text to (X)HTML. (Perl)

asciidoc

67

-

text->any

A text document formatter to XML. (Python)

txt2tags

60

-

text->any

The document conversion from text to HTML, SGML, LaTeX, man page, MoinMoin, Magic Point and ?PageMaker. (Python)

udo

17

-

text->any

universal document - text processing utility. (C)

stx2any

16

-

text->any

The document converter from structured plain text to other formats. (m4)

rest2web

16

-

text->html

The document converter from ReStructured Text to html. (Python)

aft

16

-

text->any

The "free form" document preparation system. (Perl)

yodl

16

-

text->any

A pre-document language and tools to process it. (C)

sdf

12

-

text->any

The simple document parser. (Perl)

sisu

11

-

text->any

The document structuring, publishing and search framework. (Ruby)

XML data

[http://en.wikipedia.org/wiki/XML The Extensible Markup Language (XML)] is a markup language for documents containing structured information.

[http://xml.com/ XML.COM] has good introductory information:

Basic hints for XML

XML text looks somewhat like HTML. It enables us to manage multiple formats of output for a document. One easy XML system is docbook-xsl, which is used here.

Each XML file starts with standard XML declaration:

<?xml version="1.0" encoding="UTF-8"?>

The basic syntax for one XML element is marked up as:

<name attribute="value">content</name>

XML element with empty content is marked up in the short form as:

<name attribute="value"/>

The "attribute="value"" in the above examples are optional.

The comment section in XML is marked up as:

<!-- comment -->

Other than adding markups, XML requires minor conversion to the content using predefined entities for the following character:

List of predefined entities for XML.

predefined entity

character to be converted from

&quot;

" : quote

&apos;

' : apostrophe

&lt;

< : less-than

&gt;

> : greater-than

&amp;

& : ampersand

<!> "<" or "&" can not be used in attributes or elements.

(!) When SGML style user defined entities, e.g. "&some-tag:", are used, the first definition wins over others. The entity definition is expressed in "<!ENTITY some-tag "entity value">".

(!) As long as the XML markup are done consistently with certain set of the tag name (either some data as content or attribute value), conversion to another XML is trivial task using XSLT.

XML processing

There are many tools available to process XML files such as [http://en.wikipedia.org/wiki/Extensible_Stylesheet_Language the Extensible Stylesheet Language (XSL)].

Basically, once you create well formed XML file, you can convert it to any format using Extensible Stylesheet Language for Transformation (XSLT).

Although the Extensible Stylesheet Language for Formatting Object (XSL-FO) is supposed to be solution for formatting, FOP program is not in the Debian main (yet?). So the LaTeX code is usually generated from XML using XSLT and the LaTeX system is used to create printable file such as DVI, ?PostScript, and PDF.

List of XML tools.

1

2

3

package

popcon

size

keyword

function

docbook-xml

17472

-

xml

This package contains the XML document type definition (DTD) for DocBook.

xsltproc

3804

-

xslt

XSLT command line processor. (XML-> XML, HTML, plain text, etc.)

docbook-xsl

422

-

xml/xslt

This contains XSL stylesheets for processing DocBook XML to various output formats with XSLT.

xmlto

245

-

xml/xslt

XML-to-any converter with XSLT.

dblatex

29

-

xml/xslt

This converts Docbook files to DVI, PostScript, PDF documents with XSLT.

Since XML is subset of [http://en.wikipedia.org/wiki/SGML Standard Generalized Markup Language (SGML)], it can be processed by the extensive tools available for SGML, such as [http://en.wikipedia.org/wiki/Document_Style_Semantics_and_Specification_Language Document Style Semantics and Specification Language (DSSSL)].

List of DSSL tools.

1

2

3

package

popcon

size

keyword

function

openjade

585

-

dsssl

Implementation of the DSSSL language based on James Clark's Jade software.

jade

531

-

dsssl

James lark's DSSSL language.

docbook-dsssl

821

-

xml/dsssl

This contains DSSSL stylesheets for processing DocBook XML to various output formats with DSSSL.

docbook-utils

275

-

xml/dsssl

The utilities for Docbook files including conversion to other formats (HTML, RTF, PS, man, PDF) with docbook2* commands with DSSSL.

sgml2x

23

-

SGML/dsssl

The converter from SGML and XML using DSSSL stylesheets.

The XML data extraction

You can extract HTML or XML data from other formats using followings:

List of XML data extraction tools.

1

2

3

package

popcon

size

keyword

function

wv

589

-

MSWord->any

The document converter from Microsoft Word to HTML, LaTeX, etc..

texi2html

555

-

texi->html

The converter from Texinfo to HTML.

man2html

375

-

manpage->html

The converter from manpage to HTML. (CGI support)

tex4ht

217

-

tex<->html

The converter between (La)TeX and HTML.

xlhtml

202

-

MSExcel->html

The converter from !MSExcel .xls to HTML.

ppthtml

182

-

MSPowerPoint->html

The converter from !MSPowerPoint to HTML.

unrtf

167

-

rtf->html

The document converter from RTF to HTML, etc..

info2www

127

-

info->html

The converter from GNU info to HTML. (CGI support)

ooo2dbk

35

-

sxw->xml

The converter from OpenOffice.org SXW documents to DocBook XML.

wp2x

19

-

?WordPerfect->any

WordPerfect 5.0 and 5.1 files to TeX, LaTeX, troff, GML and HTML.

doclifter

13

-

troff->xml

The converter from troff to DocBook XML.

For non-XML HTML files, you can convert them to XHTML which is an instance of well formed XML and can be processed by XML tools.

List of XML pretty print tools.

1

2

3

package

popcon

size

keyword

function

libxml2-utils

3673

-

xml<->html<->xhtml

The command line XML tool with "xmllint" command. (syntax check, reformat, lint, ...)

tidy

1962

-

xml<->html<->xhtml

HTML syntax checker and reformatter.

Once proper XML is generated, you can use XSLT technology to extract data based on the mark-up context etc.

Printable data

The Ghostscript

The core of printable data manipulation is the the Ghostscript ?PostScript interpreter. CUPS uses the Ghostscript as its backend.

The latest upstream Ghostscript from Artifex was re-licensed from AFPL to GPL and merged all the latest ESP version changes such as CUPS related ones at 8.60 release as unified release.

List of Ghostscript ?PostScript interpreters.

1

2

3

package

popcon

size

keyword

description

ghostscript

-

-

ps, pdf

GPL unified version - lenny: recommended

gs-esp

11480

-

ps, pdf

GPL ESP version - etch: recommended for use with CUPS

Merge two PS or PDF files

You can merge two PS or PDF files using the gs(1) command of the Ghostscript.

$ gs -q -dNOPAUSE -dBATCH -sDEVICE=pswrite -sOutputFile=bla.ps -f foo1.ps foo2.ps
$ gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=bla.pdf -f foo1.pdf foo2.pdf

(!) The [http://en.wikipedia.org/wiki/Portable_Document_Format Portable Document Format (PDF)], which is widely used cross-platform printable data format, is essentially the compressed PS format with few additional features and extensions.

Printable data utilities

The following packages for the printable data utilities caught my eyes:

List of printable data utilities.

1

2

3

package

popcon

size

keyword

function

poppler-utils

3324

-

pdf->ps,text,...

PDF utilities. (pdftops, pdfinfo, pdfimages, pdftotext, and pdffonts)

psutils

2950

-

ps->ps

?PostScript document conversion tools

poster

2656

-

ps->ps

Create large posters out of ?PostScript pages.

xpdf-utils

2210

-

pdf->ps,text,...

PDF utilities. (pdftops, pdfinfo, pdfimages, pdftotext, and pdffonts)

enscript

2732(2007/12)

-

text->ps, html, rtf

Converts ASCII text to Postscript, HTML, RTF or Pretty-Print.

a2ps

905

-

text->ps

'Anything to ?PostScript' converter and pretty-printer.

pdftk

449

-

pdf->pdf

PDF document conversion tool: (pdftk)

mpage

350

-

text,ps->ps

Print multiple pages per sheet.

html2ps

317

-

html->ps

The converter from HTML to ?PostScript.

pdfjam

260

-

pdf->pdf

PDF document conversion tools: pdf90, pdfjoin, and pdfnup

gnuhtml2latex

191

-

html->latex

The converter from html to latex.

latex2rtf

131

-

latex->rtf

This converts documents from LaTeX to RTF which can be read by MS Word.

ps2eps

92

-

ps->eps

The converter from ?PostScript to EPS (Encapsulated ?PostScript).

e2ps

42

-

text->ps

Text to ?PostScript converter with Japanese encoding support.

impose+

35

-

ps->ps

Postscript utilities.

trueprint

-

-

text->ps

This pretty print many source codes (C, C++, Java, Pascal, Perl, Pike, Sh, and Verilog) to ?PostScript. (C)

Printing with CUPS

Both lp and lpr commands offered by [http://en.wikipedia.org/wiki/Common_Unix_Printing_System Common Unix Printing System (CUPS)] provides options for customized printing the printable data.

For printing 3 copies of a file collated:

$ lp -n 3 -o Collate=True filename

, or

$ lpr -#3 -o Collate=True filename

You can further customize printer operation by using printer option such as "-o number-up=2", "-o page-set=even", "-o page-set=odd", "-o scaling=200", "-o natural-scaling=200", etc., documented at [http://localhost:631/help/options.html Command-Line Printing and Options].

Type setting

The Unix [http://en.wikipedia.org/wiki/Troff troff] originally developed by AT&T can be used for simple type setting. It is usually used to create manpages.

[http://en.wikipedia.org/wiki/TeX TeX] created by Donald Knuth is very powerful type setting tool and is the de facto standard . [http://en.wikipedia.org/wiki/LaTeX LaTeX] originally written by Leslie Lamport enables a high-level access to the power of TeX.

List of type setting tools.

1

2

3

package

popcon

size

keyword

function

texlive-base

1074

-

(La)TeX

TeX system for typesetting, previewing and printing.

groff

840

-

troff

GNU troff text-formatting system.

roff typesetting

Traditionally, roff is the main Unix text processing system.

See roff(7), groff(7), groff(1), grotty(1), troff(1), groff_mdoc(7), groff_man(7), groff_ms(7), groff_me(7), groff_mm(7), and info groff.

A good tutorial on -me macros exists. If you have groff (1.18 or newer), find /usr/share/doc/groff/meintro.me.gz and do the following:

$ zcat /usr/share/doc/groff/meintro.me.gz | \
     groff -Tascii -me - | less -R

The following will make a completely plain text file:

$ zcat /usr/share/doc/groff/meintro.me.gz | \
    GROFF_NO_SGR=1 groff -Tascii -me - | col -b -x > meintro.txt

For printing, use ?PostScript output.

$ groff -Tps meintro.txt | lpr
$ groff -Tps meintro.txt | mpage -2 | lpr

TeX/LaTeX

Preparation:

# aptitude install texlive

References for LaTeX:

  • The teTeX HOWTO: The Linux-teTeX Local Guide (http://www.tldp.org/HOWTO/TeTeX-HOWTO.html)

  • tex(1)

  • latex(1)

  • "The ?TeXbook", by Donald E. Knuth, (Addison-Wesley)

  • LaTeX - A Document Preparation System, by Leslie Lamport, (Addison-Wesley)

  • The LaTeX Companion, by Goossens, Mittelbach, Samarin, (Addison-Wesley)

This is the most powerful typesetting environment. Many SGML processors use this as their back end text processor. Lyx provided by lyx, lyx-xforms, or lyx-qt and GNU ?TeXmacs provided by texmacs package offers nice WYSIWYG editing environment for LaTeX while many use Emacs and Vim as the choice for the source editor.

There are many online resources available:

When documents become bigger, sometimes TeX may cause errors. You must increase pool size in /etc/texmf/texmf.cnf (or more appropriately edit /etc/texmf/texmf.d/95NonPath and run update-texmf) to fix this.

(!) The TeX source of "The ?TeXbook" is available at ftp://ftp.dante.de/pub/tex/systems/knuth/tex/texbook.tex . This file contains most of the required macros. I heard that you can process this document with tex after commenting lines 7 to 10 and adding "\input manmac \proofmodefalse". It's strongly recommended to buy this book (and all other books from Donald E. Knuth) instead of using the online version but the source is a great example of TeX input!

Pretty print a manual page

The following will print a manual page into a ?PostScript file/printer.

$ man -Tps some_manpage | lpr
$ man -Tps some_manpage | mpage -2 | lpr

Creating a manual page

Although writing manpage in plain troff is possible, there are few helper packages to create the manpage.

List of packages to help creating the manpage.

1

2

3

package

popcon

size

keyword

function

docbook-to-man

436

-

SGML->manpage

The converter from DocBook SGML into roff man macros.

help2man

104

-

text->manpage

Automatic manpage generator from --help.

info2man

41

-

info->manpage

The converter from GNU info to POD or man pages.

txt2man

35

-

text->manpage

Converts flat ASCII text to man page format.

The mail data conversion

The following packages for the mail data conversion caught my eyes:

List of packages to help mail data conversion.

1

2

3

package

popcon

size

keyword

function

sharutils

5059

-

mail

shar, unshar, uuencode, uudecode

mpack

4177

-

mail

The encoder and decoder MIME messages: mpack and munpack.

tnef

277

-

mail

unpacking MIME attachments of type "application/ms-tnef" which is a Microsoft only format.

uudeview

246

-

mail

The encoder and decoder for the following formats: uuencode, xxencode, BASE64, quoted printable, and ?BinHex

mimedecode

146

-

mail

This decodes transfer encoded text type mime messages.

readpst

33

-

windows/mail

This converts Outlook PST files to mbox format.

{i} The [http://en.wikipedia.org/wiki/Internet_Message_Access_Protocol Internet Message Access Protocol] version 4 (IMAP4) server (see: @{@popdimapeserver@}@) may be used to move mails out from the proprietary mail system if the mail client software can be configured to use IMAP4 server too.

Mail data basics

Mail (SMTP) data should be limited to 7 bit. So binary data and 8 bit text data are encoded into 7 bit format with the [http://en.wikipedia.org/wiki/MIME Multipurpose Internet Mail Extensions (MIME)] and the selection of the charset (see: @{@basicsofencoding@}@).

The standard mail storage format is mbox formatted according to [http://tools.ietf.org/html/rfc2822 RFC822], [http://tools.ietf.org/html/rfc2822 RFC2822]. See man 5 mbox (provided by the mutt package).

For European languages, "Content-Transfer-Encoding: quoted-printable" with the ISO-8859-1 charset is usually used since there are no much 8 bit characters. If the text is in UTF-8, "Content-Transfer-Encoding: quoted-printable" is also used since it is mostly 7 bit data.

For Japanese, traditionally "Content-Type: text/plain; charset=ISO-2022-JP" should be used to keep text in 7 bits. But mails from older Microsoft systems may use in Shift-JIS without proper declaration. For Japanese, if the text is in UTF-8, it contains many 8 bit data and is encoded into 7 bit data by [http://en.wikipedia.org/wiki/Base64 Base64]. The situation of other Asian languages is similar.

(!) If your non-Unix mail data is accessible by a non-Debian client software which can talk to the IMAP4 server, you may be able to move them out by running your own IMAP4 server (see: @{@popdimapeserver@}@).

(!) If you use other mail storage formats, moving them to mbox format is the good first step. The versatile client program such as mutt may be handy for this.

You can split mailbox contents to each message using procmail(1) and formail(1).

Each mail message can be unpacked using the munpack(1) command from the mpack package (or other specialized tools) to obtain the MIME encoded contents.

Graphic data tools

The following packages for the graphic data conversion, editing, and organization tools caught my eyes:

List of graphic data tools.

1

2

3

package

popcon

size

keyword

function

gimp

8507

-

image(bitmap)

The GNU Image Manipulation Program.

imagemagick

5479

-

image(bitmap)

Image manipulation programs.

graphicsmagick

244

-

image(bitmap)

Image manipulation programs. (folk of imagemagick)

xsane

4757

-

image(bitmap)

GTK+-based X11 frontend for SANE (Scanner Access Now Easy).

netpbm

2446

-

image(bitmap)

Graphics conversion tools.

icoutils

-

-

png<->ico(bitmap)

Converts MS Windows icons and cursors to and from PNG formats

xpm2wico

-

-

xpm->ico(bitmap)

Converts XPM to MS Windows icon formats

openoffice.org-draw

-

-

image(vector)

OpenOffice.org office suite - drawing

inkscape

1747

-

image(vector)

The SVG (Scalable Vector Graphics) editor.

dia-gnome

890

-

image(vector)

Diagram editor (Gnome)

dia

732

-

image(vector)

Diagram editor (Gtk)

xfig

-

-

image(vector)

Facility for Interactive Generation of figures under X11

pstoedit

652

-

ps/pdf->image(vector)

PostScript and PDF files to editable vector graphics converter. (SVG)

libwmf-bin

570

-

Windows/image(vector)

Windows metafile (vector graphic data) conversion tools.

fig2sxd

-

-

fig->sxd(vector)

Convert XFig files to OpenOffice.org Draw format

unpaper

88

-

image->image

Post-processing tool for scanned pages for OCR.

tesseract-ocr

73

-

image->text

Free OCR software based on the HP's commercial OCR engine.

tesseract-ocr-eng

-

-

image->text

OCR engine data: tesseract-ocr language files for English text.

clara

83

-

image->text

Free OCR software.

gocr

871

-

image->text

Free OCR software.

gocr-gtk

41

-

image->text

Free OCR software. GTK-GUI.

ocrad

501

-

image->text

Free OCR software.

gtkam

-

-

image(Exif)

Manipulates digital camera photo files (GNOME) - GUI

gphoto2

-

-

image(Exif)

Manipulates digital camera photo files (GNOME) - command line

kamera

-

-

image(Exif)

Manipulates digital camera photo files (KDE)

jhead

-

-

image(Exif)

Manipulates the non-image part of Exif compliant JPEG (digital camera photo) files

exif

-

-

image(Exif)

Command-line utility to show EXIF information in JPEG files

exiftags

-

-

image(Exif)

Utility to read Exif tags from a digital camera JPEG file

exiftran

-

-

image(Exif)

Transforms digital camera jpeg images

exifprobe

-

-

image(Exif)

Reads metadata from digital pictures

dcraw

-

-

image(Raw)->ppm

Decodes raw digital camera images

findimagedupes

-

-

image->fingerprint

Finds visually similar or duplicate images

ale

-

-

image->image

Merges images to increase fidelity or create mosaics

imageindex

-

-

image(Exif)->html

Generates static HTML galleries from images

bins

-

-

image(Exif)->html

Generates static HTML photo albums using XML and EXIF tags

galrey

-

-

image(Exif)->html

Generates browsable HTML photo albums with thumnails

stegdetect

-

-

jpeg

Detects and extracts [http://en.wikipedia.org/wiki/Steganography steganography] messages inside JPEG

outguess

-

-

jpeg,png

Universal Steganographic tool

{i} Search more image tools using @{@theaptituderegexformula@}@: "~Gworks-with::image".

Although GUI programs such as gimp are very powerful, command line tools such as imagemagik are quite useful for automating image manipulation with the script.

The de facto image file format of the digital camera is the [http://en.wikipedia.org/wiki/Exchangeable_image_file_format Exchangeable Image File Format] (EXIF) which is the JPEG image file format with additional metadata tags. It can hold information such as date, time, and camera settings.

[http://en.wikipedia.org/wiki/Lempel-Ziv-Welch The Lempel-Ziv-Welch (LZW) lossless data compression] patent has been expired. [http://en.wikipedia.org/wiki/Graphics_Interchange_FormatnThe Graphics Interchange Format (GIF)] utilities which use the LZW compression method are now freely available on the Debian system.

{i} Any digital camera or scanner with removable recording media will work with Linux through [http://en.wikipedia.org/wiki/USB_mass_storage_device_class USB Mass Storage] readers.

Miscellaneous data conversion

There are many other programs for converting data. Following packages caught my eyes ("~Guse::converting" in aptitude):

List of miscellaneous data conversion tools.

1

2

3

package

popcon

size

keyword

function

alien

1775

-

rpm/tgz->deb

The converter for the foreign package into the Debian package.

freepwing

6

-

EB->EPWING

The converter from "Electric Book" (popular in Japan) to a single JIS X 4081 format (a subset of the EPWING V1).

You can also extract data from RPM format with:

$ rpm2cpio file.src.rpm | cpio --extract