1. Home
  2. Tutorials
  3. HtDig
Yolinux.com Tutorial
ht://Dig

htDig - Web Site Search

How to add web page search and web page indexing capability to your web site with ht://Dig. HtDig will provide an on-site web search capability.

HtDig: Description

ht://Dig is available with most Linux distributions and is intended for a single web site or domain. Unlike WAIS or Pearlsearch which indexes a single server, ht://Dig can span several web servers. It is not an internet search engine like Yahoo or Google.

ht://Dig is a free GPL Open Source index and search engine one can install on a web server. It first generates a database by "indexing" the web content. HtDig provieds a CGI to support searching the database to generate a web page of search results pointing to the content on the website.

HtDig will index HTML and text file content to generate a search database for key words. It will also email you when there are "expired" documents.

HtDig: Installation
  • Red Hat/CentOS: yum install htdig htdig-web
  • Ubuntu: sudo apt-get install htdig
  • Install from source:
    • tar xzf htdig-3.2.0.tar.gz
    • cd htdig-3.2.0
    • ./configure --prefix=/opt
    • make depend
    • make
    • make install

HtDig: Configuration

  • Config file: /etc/htdig/htdig.conf
    # Specify where the database files need to go. Needs lots of disk space.
    database_dir:           /var/lib/htdig
    
    # This specifies the URL where the robot (htdig) will start.
    # You can specify multiple URLs here (separate with whitespace).
    start_url:              http://www.yourdomain.com
    
    ...
    ...
        

    If supporting multiple virtual domains you may want to create /etc/htdig/htdig-domain1.conf, /etc/htdig/htdig-domain2.conf, etc

    File: /etc/htdig/htdig-domain1.conf
    database_dir:   /var/lib/htdig/domain1
    
    start_url:      http://www.domain1.com
    
    common_dir:     /usr/share/htdig/domain1
    
    exclude_urls:   /cgi-bin/ .cgi images
    ...
    ...
        

  • HtDig search index database directory: /var/lib/htdig

    If supporting multiple domains, create: mkdir /var/lib/htdig/domain1, etc

  • Generate the database: rundig -c /etc/htdig/htdig-domain1.conf

    To avoid down time, use the "-a" command line option: rundig -c /etc/htdig/htdig-domain1.conf -a which allows users to search the site while you are spidering the content.

  • Header and Footer pages: (used to display htDig search results)
    • Red Hat:
      /usr/share/htdig/footer.html
      /usr/share/htdig/header.html
      /usr/share/htdig/nomatch.html
      etc
              
      Virtual/multiple domains: /usr/share/htdig-domain1/, /usr/share/htdig-domain2/, etc
      • mkdir /usr/share/htdig-domain1
      • mkdir /usr/share/htdig-domain2
      • cp /usr/share/htdig/* /usr/share/htdig-domain1/
      • cp /usr/share/htdig/* /usr/share/htdig-domain2/
      • chcon -Rt httpd_sys_content_t /usr/share/htdig-domain1/
    • Ubuntu:
      /etc/htdig/nomatch.html
      /etc/htdig/footer.html
      /etc/htdig/header.html
      etc
              

  • Add HTML form to web page:
    ...
    ...
    
    <form method="post" action="/cgi-bin/htsearch">
    <font size="-1">
    Match: <select name="method">
    <option value="and">All</option>
    <option value="or">Any</option>
    <option value="boolean">Boolean</option>
    </select>
    Format: <select name="format">
    <option value="builtin-long">Long</option>
    <option value="builtin-short">Short</option>
    </select>
    Sort by: <select name="sort">
    <option value="score">Score</option>
    <option value="time">Time</option>
    <option value="title">Title</option>
    <option value="revscore">Reverse Score</option>
    <option value="revtime">Reverse Time</option>
    <option value="revtitle">Reverse Title</option>
    </select>
    </font>
    <input type="hidden" name="config" value="htdig"/>
    <input type="hidden" name="restrict" value=""/>
    <input type="hidden" name="exclude" value=""/>
    <br />
    Search:
    <input type="text" size="30" name="words" value=""/>
    <input type="submit" value="Search"/>
    </form>
    
    ...
    ...
        
    ...
    ...
    Match: Format:
    Sort by:

    Search:

    ...
    ...

    Note for multiple domains reference the configuration for that domain:
    <input type="hidden" name="config" value="htdig-domain1"/>

    For a simple single search box, hard code the previous "options":
    ...
    ...
    
    <form method="post" action="/cgi-bin/htsearch">
    <input type="hidden" name="method" value="all"/>
    <input type="hidden" name="format" value="long"/>
    <input type="hidden" name="sort" value="score"/>
    <input type="hidden" name="config" value="htdig"/>
    <input type="hidden" name="restrict" value=""/>
    <input type="hidden" name="exclude" value=""/>
    Search:
    <input type="text" size="30" name="words" value=""/>
    <input type="submit" value="Search"/>
    </form>
    
    ...
    ...
        
    ...
    ...
    Search:

    ...
    ...

    Note for multiple domains reference the configuration for that domain:
    <input type="hidden" name="config" value="htdig-domain1"/>

  • Default Apache web server configuration: /etc/httpd/conf.d/htdig.conf
    Alias /htdig /usr/share/htdig
    Alias /htdig-domain1 /usr/share/htdig-domain1
        
    Restart the apache web server to pick up the new configuration:
    • Red Hat: /etc/init.d/httpd restart
    • Ubuntu: /etc/init.d/apache2 restart

  • Test in browser: http://www.domain1.com/cgi-bin/htsearch?config=htdig-domain1&words=testword
Customizing the ht://Dig reults page:

The default page presentation is compiled into the CGI. To invoke the use of the header and footer files, the header and footer directives or the template directives must be turned on in the config file: /etc/htdig/htdig-domain1.conf

search_results_header: /usr/share/htdig-domain1/header.html
search_results_footer: /usr/share/htdig-domain1/footer.html
search_results_wrapper: /usr/share/htdig-domain1/wrapper.html
nothing_found_file: /usr/share/htdig-domain1/nomatch.html
syntax_error_file: /usr/share/htdig-domain1/syntax.html
    

Definetly specify nomatch.html as a blank page is uninformative.

Custom HTML Files:

FileDescription
COMMON_DIR/header.html The default search results header file.
COMMON_DIR/footer.html The default search results footer file.
COMMON_DIR/wrapper.html The default search results wrapper file, that contains the header and footer together in one file.
COMMON_DIR/nomatch.htmlPage stating that "No matches" were found for the search terms.
COMMON_DIR/syntax.html The default file that explains boolean expression syntax errors to the user.

Where COMMON_DIR is:
  • Red Hat: /usr/share/htdig/
  • Ubuntu: /etc/htdig/

ht://Dig notes:

  • Exclude a single content page from the search:
    <meta name="robots" content="noindex, follow">

    Place in the "head" section of the page to be overlooked.

  • List of words ignored by spider: /usr/share/htdig/bad_words
    These are words like "the, and, for, with, that, this", etc.

  • Example cron job to re-index each week:
    File: /etc/cron.weekly/htdig
    #!/bin/sh
    /usr/bin/rundig -c /etc/htdig/htdig-domainX.conf -a
        
    Also see the YoLinux.com cron sysadmin tutorial

Apache Web Server Configuration:

Search results pages produced by HtDig use graphics provided by HtDig. To enable web server access, add the following:

...

    Alias /htdig/ "/usr/share/htdig/"
    <Directory "/usr/share/htdig">
        Options Indexes MultiViews FollowSymLinks
        AllowOverride All
        Order allow,deny
        allow from all
        Require all granted
    </Directory>

...
    
Apache httpd 2.4 configuration snipet

ht://Dig Man Pages:
  • htdig: retrieve HTML documents for ht://Dig search engine
  • htsearch: create document index and word database
  • htdump: write out an ASCII-text version of the document database
  • htdigconfig: script to create fuzzy databases for ht://Dig
  • htfuzzy: fuzzy command-line search utility for the ht://Dig
  • htload: reads in an ASCII-text version of the document database
  • htmerge: create document index and word database from files that were created by htdig.
  • htnotify: sends email notifications about out-dated web pages discov- ered by htmerge
  • htpurge: remove unused documents from the database
  • htstat: returns statistics on the document and word databases
  • rundig: sample script to create a search database for ht://Dig

Links: