placeholder

isbnsearch Z39.50 server verification process

[ Next ] [ Back]

Preparation

(This entire process requires an internet connection.)

Summary

  1. cd isbnsearch/devel/scripts
  2. ./gateway.sh
  3. Download the latest DB.txt from Net::Z3950::AsyncZ on CPAN into the htdocs/ directory
  4. bash ./asyncz.sh
  5. Create a TEMPORARY database in MySQL, e.g. 'devel'. Do NOT use your isbnsearch database!
  6. Populate the temporary database using the SQL files provided:
    • mysql devel -p < ../sql/tables.sql
    • mysql devel -p < ../sql/gatewaynew.sql
    • mysql devel -p < ../sql/asyncz.sql
    • mysql devel -p < ../sql/servers.sql
  7. Configure these perl scripts with your database settings.
    1. duplicates.pl
    2. export-servers.pl
    3. serverfail.pl
    4. servers.pl
    5. verify.pl
  8. perl duplicates.pl

The following sections describe how each script operates.

Updating from the LoC Gateway

All scripts are intended to be run from the isbnsearch/devel/scripts directory.

The bash file gateway.sh automates the update from the LoC page. The majority of the output and work is using wget. The gateway page itself is downloaded first. The HTML is processed and compared against a version of the previous htdocs/gateway.html page, reduced to just compare new servers, not name changes. The major omission in the data at this stage is the database table name. The scritps/gateway.sh script now uses wget to connect to each new server using the information from the gateway.html page and requests the HTML query page for each server. This HTML contains a DBNAME hidden input field that contains the database name in the VALUE attribute. The databasename.pl perl script then processes the wget output to generate this SQL output:
insert into gateway set name = "alice.mainlib.brandeis.edu", port = 9929, z3950 = "MBB01";
Finally, gateway.sh moves gateway.html into htdocs/, overwriting the previous release copy. This will be the base of the next comparison.

Because of the process, the new servers should be usable as new listings - servers that fail to initialise Z39.50 via the LoC gateway will fail to generate a database name.


Updating from Net::Z3950::AsyncZ

As with the gateway servers, diff is too simplistic to find the new or updated servers in the AsyncZ list because a lot of changes are just name changes to the description field. AsyncZ has historically had a lot of duplicates, including some lines where the canonical name in one line resolves to the IP address listed in another line. If the port and z3950 names are also identical, this duplicate needs to be ignored by isbnsearch.

http://search.cpan.org/src/TURNERMM/Net-Z3950-AsyncZ-0.10/scripts/DB.txt
This process is not completely automated - the DB.txt file is accessible over the internet but the URL contains the version number of the release and this may need to be scripted to look for the newer releases, so the automation may be part of the next devel release.

For v0.5, isbnsearch-devel provides a single bash script to process the content once the DB.txt file is downloaded to the htdocs/ directory.
bash ./asyncz.sh
The script strips out the variable names and leaves just the vital server data.

There are also formatting errors in scripts/DB.txt of the AsyncZ module where fields are not correctly delimited with ; e.g.:
St.;Louis,;MO (INNOPAC) catalog.wustl.edu 210 INNOPAC
should be:
St. Louis, MO (INNOPAC) ;catalog.wustl.edu;210;INNOPAC

These errors will cause problems for verification and need to be ignored.

asyncz.sh calls the Perl script asynczcheck.pl to iterate through the asyncz.result file and check each canonical name for a valid DNS entry. The port number is checked to be in the range 0-100000. All entries are converted to lower case to catch capitalisation duplication. The IP address is then combined with the port number and database name to form the key to a hash. The canonical name is stored as the value. This prevents duplication of the same server. For records that do not contain a canonical name, the IP address is combined with the port number and database name and the hash is checked to see if this key value already exists. If not, the key is added with the IP address as value.

Layout for gateway and asyncz tables:

CREATE TABLE gateway (
  id int(10) unsigned NOT NULL auto_increment,
  info text,
  name varchar(100) default NULL,
  port varchar(10) default NULL,
  z3950 varchar(100) default NULL,
  error varchar(50) default '?',
  htmlid varchar(50) default NULL,
  interface varchar(50) default NULL,
  PRIMARY KEY  (id)
  ) TYPE=MyISAM;

CREATE TABLE asyncz (
  id int(10) unsigned NOT NULL auto_increment,
  info text,
  name varchar(100) default NULL,
  port varchar(10) default NULL,
  z3950 varchar(100) default NULL,
  error varchar(50) default '?',
  htmlid varchar(50) default NULL,
  interface varchar(50) default NULL,
  PRIMARY KEY  (id)
  ) TYPE=MyISAM;

Create these two tables (a SQL file is provided as sql/tables.sql) in a TEMPORARY MySQL database that can be dropped once the complete verification process is finished.

If these tables already exist from a previous verification, drop the tables and re-create them with the SQL in the package. The old data is neither required nor useful.

Layout for servers table - the devel.servers table is to contain the combined list of servers with all duplicates removed, including where the canonical name appears in the gateway and the corresponding IP in asyncz. This has the side-effect of eliminating most canonical names that do not resolve using DNS. Most servers that previously appeared as DEAD in isbnsearch-devel reports will now simply be ignored until the next run of duplicates.pl at the time of the next release. Previous error states are not transferred to the new table so that all servers are re-tested.

CREATE TABLE servers (
  id int(10) unsigned NOT NULL auto_increment,
  name varchar(100) default NULL,
  port varchar(10) default NULL,
  z3950 varchar(100) default NULL,
  error varchar(50) default '?',
  PRIMARY KEY  (id)
  ) TYPE=MyISAM;

Insert the server details to be verified

Create the servers table (if you used sql/tables.sql it will have been created for you) in the same temporary database.

Import both sql/gatewaynew.sql and sql/asyncz.sql into the relevant tables, remembering to switch to the temporary database or to specify the temporary database name on the command line.

This is an example only, using the temporary database name 'devel'

mysql devel -p < sql/gatewaynew.sql
mysql devel -p < sql/asyncz.sql

Now import the main isbnsearch servers from the isbnsearch package (copied into the devel package as sql/servers.sql - this will allow existing servers to be re-verified.

mysql devel -p < sql/servers.sql


Removing duplicates

duplicates.pl needs to be configured for your database settings before it can run. Use your favourite text editor to add details for your MySQL host, user, password and the name of the temporary database.

Unlike the main isbnsearch package, the table names are not configurable, if you need to change them, be careful to change all references in the duplicates.pl script AND the SQL files.

duplicates.pl combines the two tables into one SQL output of unique records that is then added to the 'servers' table, ready for verification:
perl duplicates.pl

Note: duplicates.pl will delete all entries from the asyncz and gateway tables in the temporary database.

[ Next ] [ Back ]


Valid HTML 4.01!

The copyright licensing notice below applies to this text.

Copyright © 2003-2004 Neil Williams

Permission is granted to copy, distribute, and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts. A copy of this license is included in the file copying.txt

divider bar SourceForge.net Logo