discuss: Thread: automation cleanup of source LDP tree (informational)

Subject: automation cleanup of source LDP tree (informational)
From: "Martin A. Brown" ####@####.####
Date: 27 Jan 2016 01:11:23 +0000
Message-Id: <alpine.LSU.2.11.1601261656560.2025@znpeba.jbaqresebt.arg>

Hello,

>  1. automation: Be able to (re-)process and (re-)publish all of
>     our existing documentation in an automated fashion.

This is a description of work I have already accomplished and 
committed to my own git repository.

Automation cleanup (source):
----------------------------
Many of the documents at git HEAD [0] in our main LDP/howto tree
sport validation errors when processed with toolchains running on
modern Linux releases (i.e., OpenSUSE-13.2 and Ubuntu-14.04.3).  I
have a (local) git repository with hundreds of corrections to
source files in all formats (Linuxdoc, DocBook SGML and DocBook
XML).

I would characterize these corrections as non-editorial--i.e. they 
are technical only, to allow each document to validate and to allow 
the processor to generate outputs.

The only substantive change I have made in the cleanups is to move 
any <graphic/>, <mediaobject/>, or <inlinemediaobject/> images into 
an ./images/ directory, which is copied to the HTML (output) tree.  
Otherwise, images are not visible in the output.  Not desirable.

My cleanup changes (about 200 commits) are at:

  https://github.com/martin-a-brown/LDP

Since I doubt anybody wants to read through the entire git log, 
here's a shorter description of the various classes of changes that 
I have made to the individual documents:

  * adding countless closing tags, such as </sect1>, </sect2>,
    </sect3>, </listitem>, </para>, </varlistentry>
  * switching to entities for reserved characters, e.g. & to &amp;,
    <> to &lt;&gt;, [] to &lsqb;&rsqb;, etc. (particularly where
    people had left email addresses in angle brackets)
  * renaming files containing XML from stem.sgml to stem.xml
  * character set encoding; using entities in ASCII, converting to
    Unicode with Byte Order Marker (BOM) where possible
  * corrections to many DOCTYPE definitions
  * "upgrading" DocBook versions when authors used elements or
    features from a newer DocBook standard (e.g. 3
  * substituting dash for underscore in the id attribute ([open]jade
    refuses _ in id=)
  * commit in repo converted images (e.g. eps) files for documents
    (processors do not generate them on the fly; did they used to?)
  * adding XML/SGML comment closures -->, where accidentally
    omitted; removing stray '--' which was confusing SGML/XML
    processors
  * wrapping large blocks of <programlisting/> code with
    <![CDATA[]]>
  * replacing non-DocBook XML elements with DocBook equivalents,
    i.e. <xlink:href/> becomes <ulink/>; replacing HTML elements
    <a href=""> with <url url=""> in Linuxdoc documents
  * removing extra (and sometimes empty) tags which confused the
    processor

  * and, probably many other small errors that jade or xsltproc
    complained about...

I will observe that the vast majority of these corrections were on
DocBook (both SGML and XML) files.

Several Linuxdoc files required adding missing tags, correcting a 
few tag names and even a few entity corrections, as well.  I guess 
that earlier SGML processors (or their operating configurations) 
were more forgiving of many of these errors.

This message treats the cleanup needed only of the source tree.

There is separate work for the cleanup of the output tree, lots of 
old documents that maybe should be in archived, etc.

-Martin

 [0] https://github.com/martin-a-brown/LDP

-- 
Martin A. Brown
http://linux-ip.net/

©The Linux Documentation Project, 2014. Listserver maintained by dr Serge Victor on ibiblio.org servers. See current spam statz.