discuss: [tbm@cyrius.com: Contacting mirrors with old HOWTOs?]


Previous by date: 12 Jan 2003 15:05:47 -0000 Submit, dude.resin.csoft.net
Next by date: 12 Jan 2003 15:05:47 -0000 Anybody do Docbook XML on MS Windows?, David Horton
Previous in thread: 12 Jan 2003 15:05:47 -0000 Re: [tbm@cyrius.com: Contacting mirrors with old HOWTOs?], Chris Karakas
Next in thread: 12 Jan 2003 15:05:47 -0000 Re: [tbm@cyrius.com: Contacting mirrors with old HOWTOs?], David Lawyer

Subject: Re: [tbm@cyrius.com: Contacting mirrors with old HOWTOs?]
From: Alexander Bartolich ####@####.####
Date: 12 Jan 2003 15:05:47 -0000
Message-Id: <3E218393.3070405@gmx.at>

Chris Karakas wrote:
 > [..] Somebody please write a script "staleHT" (for stale HowTos)
 > for this.

I guess that means me.

 > I mean, something that would call Google with the aproppriate
 > HowTo name (or some string that identifies it uniquely)

The name of the document will not change with every version.
To identity different versions unambigously we need a different
but unique string for every version. My favourite to generate
the string is md5sum. It is a standard, free, and available for
every platform.
A line of md5sum's output contains both hash and file name.
Together with a smart file name containing version number or
date this should solve the problem of identification.
A silly example:

270358b7773e290c91f35fb7175e2b46  silly-HOWTO-2003-01-12.html.tar.gz
979948da5a25c001b25d533dde6e9abd  silly-HOWTO-2003-01-12.tar.gz

I am not sure how this hash should be included in the document
text, though. Constructing a text containing it's own md5sum
hash is non-trivial, to say the least. The alternative is to
define a forward reference like &lt;a href="md5sum"&gt; and
have the key in a separate file that can be added later on.

This will work with google and 'wget -m'. People copying and
extracting .html.tar.gz but not the extra MD5SUM file will not
have the key, however. I see two approaches:

+ Use 'tar -A' to append the file md5sum into the .tar.gz
   This will change the hash of the resulting (second-generation)
   .tar.gz, however. Using the shipped md5sum to verify the
   archive will require the reverse of 'tar -A' beforehand.

+ Wrap both .tar.gz and md5sum into another .tar.gz
   Compressing the second generation archive looks pointless,
   but gzip provides it's own checksum to detect simple bit
   errors.

 > and country code,

Filtering FQDNs for geographic location is complex. There is

http://netgeo.caida.org/perl/netgeo.cgi

(accepts only IP-Addresses, do nslookup by hand!), but that
fails with large web space hosters who allocate one huge block
of IP-Addresses but distribute their machines all over the planet.

 > [...] get their Admin e-Mails from the Whois database

The next _big_ problem. It's exactly the private archives at
geocities or wherever that are likely to be outdated.
I guess that decoding big.evil.site/~poor-chap could actually
work, though.

 > and feed a mail client with the text and the address.

Bugs in this script could earn you the title Spammer ...

 > I would love to incorporate "staleHT de" (for Germany) in
 > my crontab to be executed every 2 months or so ;-)

I guess that finding the person responsable and sending the
actual mail requires human intelligence. But with the hash
system we could at least gather statistics and give reasonable
statements about the size of the problem.


Previous by date: 12 Jan 2003 15:05:47 -0000 Submit, dude.resin.csoft.net
Next by date: 12 Jan 2003 15:05:47 -0000 Anybody do Docbook XML on MS Windows?, David Horton
Previous in thread: 12 Jan 2003 15:05:47 -0000 Re: [tbm@cyrius.com: Contacting mirrors with old HOWTOs?], Chris Karakas
Next in thread: 12 Jan 2003 15:05:47 -0000 Re: [tbm@cyrius.com: Contacting mirrors with old HOWTOs?], David Lawyer


  ©The Linux Documentation Project, 2014. Listserver maintained by dr Serge Victor on ibiblio.org servers. See current spam statz.