the ht://Dig web site search engine

introduction

At its most basic a web site is composed of a web server and a few HTML pages. When a site grows it becomes harder for a customer to find the page she wants. A popular optional extra for a site is a search engine.

A search engine is an application that includes a data store and programs for managing the data store. The store is full of web site content, with an index containing all the words in the site. A customer gives a word to the search engine and the search engine returns a list of every part of the site that uses the word. The program that collects information from a web site is called a spider.

Google and Yahoo! are vast closed source search engines that search the entire Internet. Ht://Dig and Lucene are open source search engines for searching a few web sites.

what it is

htdig application early each morning, spidering intranet files. All html files are parsed by htdig. A helper app parses pdf, excel, word and powerpoint files.

There are three programs making up htdig.

* htdig creates db.docdb
* htmerge creates db.docs.index, db.wordlist.work and db.words.db.
* htsearch uses db.docdb, db.docs.index, and db.words.db.

Use the -a option to create a seperate set of databases.

.work files are created by www-world-wide-web/htdig-web-site-search-engineerge with the -a flag and are never used by htsearch.

* htdig creates db.docdb, & db.docdb.work.
* htmerge creates db.docs.index, db.wordlist.work and db.words.db.
* htsearch uses db.docdb, db.docs.index, and db.words.db.


dir structure

/opt/htdig/
           bin
           common
           conf
           db
           logs
           parse
           search

helper applications

The helper apps are used when ht://dig finds certain types of Content-Type header in the reply. This test shows a question from me pretending to be a web client. The question is sent to a web server and the server answers. The answer contains "Content-Type: application/msword".

nsuser@EWE01:>telnet 192.168.98.128 80
Trying 192.168.98.128...
Connected to 192.168.98.128.
Escape character is '^]'.
HEAD  /intranet/ppc/office/word/projectmandate.doc HTTP/1.1
HOST: www.intranet.company01.co.uk
connection: close

HTTP/1.1 200 OK
Date: Thu, 10 Nov 2005 14:43:11 GMT
Server: Apache/1.3.27 (Unix) mod_jk/1.2.5 DAV/1.0.3 PHP/4.2.3
Last-Modified: Fri, 18 Feb 2005 16:18:03 GMT
ETag: "51ea9-11200-4216153b"
Accept-Ranges: bytes
Content-Length: 70144
Connection: close
Content-Type: application/msword

Connection closed by foreign host.
nsuser@EWE01:>

A helper application deals with files that ht://dig can't handle, like PDF and MS Word files. This is called doc2html.pl and lives in the parse dir. It sometimes fails and leaves parsed excel files called htdex.XXXXX.

doc2html.pl

doc2html.pl parses several file formats into html, which htdig can then use.

catdoc

MS Word documents are converted by a utility called catdoc. The directory /usr/local/share/catdoc contains lots of code mappings used by catdoc.

remote-access/x-windows-systeml

MS Excel spreadsheets are parsed by remote-access/x-windows-systeml.

The excel helper "remote-access/x-windows-systeml" seems to cock up on some files. It is now excluded from the PPP intranet search. You can find core files and screw-ups in its working directory.

pdf2html.pl

PDF files are converted by this perl script. See /opt/htdig/parse/pdf2html.pl

(from http://www.htdig.org/)

Features

Here are some of the major features of ht://Dig. They are in no particular order.

Intranet searching
ht://Dig has the ability to search through many servers on a network by acting as a WWW browser.
It is free
The whole system is released under the GNU General Public License
Robot exclusion is supported
The Standard for Robot Exclusion is supported by ht://Dig.
Boolean expression searching
Searches can be arbitrarily complex using boolean expressions.
Configurable search results
The output of a search can easily be tailored to your needs by means of providing HTML templates.
Fuzzy searching
Searches can be performed using various configurable algorithms. Currently the following algorithms are supported (in any combination):
  • exact
  • soundex
  • metaphone
  • common word endings (stemming)
  • synonyms
  • accent stripping
  • substring and prefix
Searching of HTML and text files
Both HTML documents and plain text files can be searched. Searching of other file types will be supported in future versions.
Keywords can be added to HTML documents
Any number of keywords can be added to HTML documents which will not show up when the document is viewed. This is used to make a document more like to be found and also to make it appear higher in the list of matches.
Email notification of expired documents
Special meta information can be added to HTML documents which can be used to notify the maintainer of those documents at a certain time. It is handy to get reminded when to remove the "New" images from a certain page, for example.
A Protected server can be indexed
ht://Dig can be told to use a specific username and password when it retrieves documents. This can be used to index a server or parts of a server that are protected by a username and password.
Searches on subsections of the database
It is easy to set up a search which only returns documents whose URL matches a certain pattern. This becomes very useful for people who want to make their own data searchable without having to use a separate search engine or database.
Full source code included
The search engine comes with full source code. The whole system is released under the terms and conditions of the GNU Public License version 2.0
The depth of the search can be limited
Instead of limiting the search to a set of machines, it can also be restricted to documents that are a certain number of "mouse-clicks" away from the start document.
Full support for the ISO-Latin-1 character set
Both SGML entities like 'à' and ISO-Latin-1 characters can be indexed and searched.

what it isn't

A web site directory. A directory is a data store of content that is created by people. The store is more ordered than a search engine store, possibly arranged in a hierarchy, and contains less junk.

where it is

history