ht://Dig is available with most Linux distributions and is intended for a single web site or domain. Unlike WAIS or Pearlsearch which indexes a single server, ht://Dig can span several web servers. It is not an internet search engine like Yahoo or Google.
ht://Dig is a free GPL Open Source index and search engine one can install on a web server. It first generates a database by "indexing" the web content. HtDig provieds a CGI to support searching the database to generate a web page of search results pointing to the content on the website.
HtDig will index HTML and text file content to generate a search database for key words. It will also email you when there are "expired" documents.
- Red Hat/CentOS: yum install htdig htdig-web
- Ubuntu: sudo apt-get install htdig
- Install from source:
- tar xzf htdig-3.2.0.tar.gz
- cd htdig-3.2.0
- ./configure --prefix=/opt
- make depend
- make
- make install
- Config file: /etc/htdig/htdig.conf
# Specify where the database files need to go. Needs lots of disk space. database_dir: /var/lib/htdig # This specifies the URL where the robot (htdig) will start. # You can specify multiple URLs here (separate with whitespace). start_url: http://www.yourdomain.com ... ...
database_dir: /var/lib/htdig/domain1 start_url: http://www.domain1.com common_dir: /usr/share/htdig/domain1 exclude_urls: /cgi-bin/ .cgi images ... ...
- HtDig search index database directory: /var/lib/htdig
If supporting multiple domains, create: mkdir /var/lib/htdig/domain1, etc
- Generate the database: rundig -c /etc/htdig/htdig-domain1.conf
To avoid down time, use the "-a" command line option: rundig -c /etc/htdig/htdig-domain1.conf -a which allows users to search the site while you are spidering the content.
- Header and Footer pages: (used to display htDig search results)
- Red Hat:
/usr/share/htdig/footer.html /usr/share/htdig/header.html /usr/share/htdig/nomatch.html etc
Virtual/multiple domains: /usr/share/htdig-domain1/, /usr/share/htdig-domain2/, etc- mkdir /usr/share/htdig-domain1
- mkdir /usr/share/htdig-domain2
- cp /usr/share/htdig/* /usr/share/htdig-domain1/
- cp /usr/share/htdig/* /usr/share/htdig-domain2/
- chcon -Rt httpd_sys_content_t /usr/share/htdig-domain1/
- Ubuntu:
/etc/htdig/nomatch.html /etc/htdig/footer.html /etc/htdig/header.html etc
- Red Hat:
- Add HTML form to web page:
... ... <form method="post" action="/cgi-bin/htsearch"> <font size="-1"> Match: <select name="method"> <option value="and">All</option> <option value="or">Any</option> <option value="boolean">Boolean</option> </select> Format: <select name="format"> <option value="builtin-long">Long</option> <option value="builtin-short">Short</option> </select> Sort by: <select name="sort"> <option value="score">Score</option> <option value="time">Time</option> <option value="title">Title</option> <option value="revscore">Reverse Score</option> <option value="revtime">Reverse Time</option> <option value="revtitle">Reverse Title</option> </select> </font> <input type="hidden" name="config" value="htdig"/> <input type="hidden" name="restrict" value=""/> <input type="hidden" name="exclude" value=""/> <br /> Search: <input type="text" size="30" name="words" value=""/> <input type="submit" value="Search"/> </form> ... ...
...
...
...
...
Note for multiple domains reference the configuration for that domain:
<input type="hidden" name="config" value="htdig-domain1"/>... ... <form method="post" action="/cgi-bin/htsearch"> <input type="hidden" name="method" value="all"/> <input type="hidden" name="format" value="long"/> <input type="hidden" name="sort" value="score"/> <input type="hidden" name="config" value="htdig"/> <input type="hidden" name="restrict" value=""/> <input type="hidden" name="exclude" value=""/> Search: <input type="text" size="30" name="words" value=""/> <input type="submit" value="Search"/> </form> ... ...
...
...
...
...
Note for multiple domains reference the configuration for that domain:
<input type="hidden" name="config" value="htdig-domain1"/> - Default Apache web server configuration: /etc/httpd/conf.d/htdig.conf
Alias /htdig /usr/share/htdig Alias /htdig-domain1 /usr/share/htdig-domain1
Restart the apache web server to pick up the new configuration:- Red Hat: /etc/init.d/httpd restart
- Ubuntu: /etc/init.d/apache2 restart
- Test in browser: http://www.domain1.com/cgi-bin/htsearch?config=htdig-domain1&words=testword
The default page presentation is compiled into the CGI. To invoke the use of the header and footer files, the header and footer directives or the template directives must be turned on in the config file: /etc/htdig/htdig-domain1.conf
search_results_header: /usr/share/htdig-domain1/header.html search_results_footer: /usr/share/htdig-domain1/footer.html search_results_wrapper: /usr/share/htdig-domain1/wrapper.html nothing_found_file: /usr/share/htdig-domain1/nomatch.html syntax_error_file: /usr/share/htdig-domain1/syntax.html
Definetly specify nomatch.html as a blank page is uninformative.
Custom HTML Files:
File | Description |
---|---|
COMMON_DIR/header.html | The default search results header file. |
COMMON_DIR/footer.html | The default search results footer file. |
COMMON_DIR/wrapper.html | The default search results wrapper file, that contains the header and footer together in one file. |
COMMON_DIR/nomatch.html | Page stating that "No matches" were found for the search terms. |
COMMON_DIR/syntax.html | The default file that explains boolean expression syntax errors to the user. |
Where COMMON_DIR is:
- Red Hat: /usr/share/htdig/
- Ubuntu: /etc/htdig/
- Exclude a single content page from the search:
<meta name="robots" content="noindex, follow">Place in the "head" section of the page to be overlooked.
- List of words ignored by spider: /usr/share/htdig/bad_words
These are words like "the, and, for, with, that, this", etc. - Example cron job to re-index each week:
File: /etc/cron.weekly/htdig#!/bin/sh /usr/bin/rundig -c /etc/htdig/htdig-domainX.conf -a
Search results pages produced by HtDig use graphics provided by HtDig. To enable web server access, add the following:
... Alias /htdig/ "/usr/share/htdig/" <Directory "/usr/share/htdig"> Options Indexes MultiViews FollowSymLinks AllowOverride All Order allow,deny allow from all Require all granted </Directory> ...
- htdig: retrieve HTML documents for ht://Dig search engine
- htsearch: create document index and word database
- htdump: write out an ASCII-text version of the document database
- htdigconfig: script to create fuzzy databases for ht://Dig
- htfuzzy: fuzzy command-line search utility for the ht://Dig
- htload: reads in an ASCII-text version of the document database
- htmerge: create document index and word database from files that were created by htdig.
- htnotify: sends email notifications about out-dated web pages discov- ered by htmerge
- htpurge: remove unused documents from the database
- htstat: returns statistics on the document and word databases
- rundig: sample script to create a search database for ht://Dig