$Id: Release-Notes-1.3.txt,v 1.14 1995/09/05 21:00:00 duane Exp $ TABLE OF CONTENTS 1. Gatherer IP-based filtering Username/Passwords Post-Summarizing Cache directory cleanup Limit on retrieval size Support for HTML-3.0, Netscape, and HotJava DTDs 2. Broker Brokers.cf Glimpse 3.0 Verity/Topic WAIS, Inc. Displaying SOIF attributes in results Uniqify duplicate objects Glimpse inline queries 3. Cache Persistent disk cache Common logfile format Improved Internet protocol support TTL calculation by regexp Improved customizability Security Performance Enhancements Portability Optional Code 4. Miscellaneous Admin scripts ======================================================================== GATHERER IP-based filtering ------------------ It is now possible to use an IP network address in a Host filter file. The IP address is matched using regular expressions. This means that periods must be escaped. For example: Allow 128\.196\..* Deny .* Username/Passwords ------------------ It is now possible to gather password-protected documents from HTTP and FTP servers. In both cases, it is possible to specify a username and password as a part of the URL. The format is ftp://user:password@host:port/url-path http://user:password@host:port/url-path With this format, the "user:password" part is kept as a part of the URL string all throughout Harvest. This may enable anyone who uses your Broker(s) to access password-protected pages. It is also possible to have "hidden" username and password information. These are specified in the gatherer.cf file. For HTTP, the format is HTTP-Basic-Auth: realm username password 'realm' is the same as the 'AuthName' parameter given in an NCSA .htaccess file. In the CERN HTTP configuration, the realm value is called 'ServerId.' For FTP, the format in the gatherer.cf file is FTP-Auth: hostname[:port] username password Post-Summarizing ---------------- It is now possible to "fine-tune" the summary information generated by the Essence summarizers. A typical application of this would be to change the 'Time-to-live' attribute based on some knowledge about the objects. So an administrator could use the post-summarizing feature to give quickly-changing objects a lower TTL, and very stable documents a higher TTL. Objects are selected for post-processing if they meet a specified condition. A condition consists of three parts: An attribute name, an operation, and some string data. For example: city == 'New York' In this case we are checking if the 'city' attribute is equal to the string 'New York' The for exact string matching, the string data must be enclosed in single quotes. Regular expressions are also supported: city ~ /New York/ Negative operators are also supported: city != 'New York' city !~ /New York/ Conditions can be joined with '&&' (logical and) or '||' (logical or) operators: city == 'New York' && $state != 'NY'; When all conditions are met for an object, some number of instructions are executed on it. There are four types of instructions which can be specified: 1. Set an attribute exactly to some specific string Example: time-to-live = "86400" 2. Filter an attribute through some program. The attribute value is given as input to the filter. The output of the filter becomes the new attribute value. Example: keywords | tr A-Z a-z 3. Filter multiple attributes through some program. In this case the filter must read and write attributes in the SOIF format. Example: address,city,state,zip ! cleanup-address.pl 4. A special case instruction is to delete an object. To do this, simply write delete() The conditions and instructions are combined together in a "rules" file. The format of this file is somewhat similar to a Makefile; conditions begin in the first column and instructions are indented by a tab-stop. Example: type == 'HTML' partial-text | cleanup-html-text.pl URL ~ /users/ time-to-live = "86400" partial-text ! extract-owner.sh type == 'SOIFStream' delete() This rules file is specified in the gatherer.cf file with the Post-Summarizing: tag, e.g.: Post-Summarizing: lib/myrules Cache directory cleanup ----------------------- The gatherer uses a local disk cache of objects it has retrieved. These objects are stored in the tmp/cache-liburl subdirectory. Prior to v1.3 this cache directory was left in place after the gatherer completed. This caused confusion and problems when users re-ran the gatherer and expected to see new or changed objects appear. Now the default behaviour is to remove the cache-liburl directory after the gatherer completes successfully. Users who want to leave this directory in place will need to add Keep-Cache: yes to their gatherer.cf file. Limit on retrieval size ----------------------- The code for retrieving FTP, HTTP, and Gopher objects now stops transferring after 10M bytes. This is to prevent bogus URL's from filling up local disk space. This limit can currently only be changed by modifying the source in src/common/url (look for "MAX_TRANSFER_SIZE"). Support for HTML-3.0, Netscape, and HotJava DTDs ------------------------------------------------ DTDs for HTML-3.0, Netscape, and HotJava have been added to the collection in lib/gatherer/sgmls-lib/HTML/. To take advantage of these DTDs your HTML pages should begin with one of: ======================================================================== BROKER Brokers.cf ---------- Prompted by security concerns, there is a change in the way that BrokerQuery.pl.cgi connects with a broker. The old method had the broker hostname and port number passed as CGI arguments. The new way passes the broker short name instead. This name is then looked up in the file $HARVEST_HOME/brokers/Brokers.cf. The CreateBroker program will add the correct entry to Brokers.cf. The old method still works for backwards compatibility. With the new method, the broker name must appear in the Brokers.cf file. If it does not, the user receives an error message. The Brokers.cf file may also provide interesting features such as * quickly relocating brokers to other machines * using dual brokers for 24hr/day availability If you change your broker port number (in admin/broker.conf) then don't forget to change it here as well. Glimpse 3.0 ----------- Harvest now uses Glimpse 3.0 which includes a number of bugfixes and performance improvements: * A new data structure considerably speeds up queries on large indexes. Typical queries now take less than one second, even for very large indexes. * Incremental indexing is now fully supported. * The on-disk indexing structures have been improved in several ways. As a result, indexes from previous versions are incompatible. When upgrading to this release, you should remove all .glimpse_* files in the broker directory before restarting the broker. * Glimpse can now handle more than 64k objects in the broker. Verity/Topic ------------ This release includes support for using Verity Inc.'s Topic indexing engine with the broker. In order to use Topic with Harvest, a license must be purchased from Verity (see http://www.verity.com/). At this point, Harvest does not make use of all features in the Topic engine. However, does include a number of features that make it attractive: * Background indexing: the broker will continue to service requests as new objects are added to the database. * Matched lines (or Highlights): lines containing query terms are displayed with the result set. * Result set ranking * Flexible query operations such as proximity, stemming, and thesaurus. WAIS, Inc. ---------- This release includes support for using WAIS Inc.'s commercial WAIS indexing engine with the broker. To use commercial WAIS with Harvest, a license must be purchased from WAIS Inc. (see http://www.wais.com/). The WAIS/Harvest combination offers the following features: * Structured queries (not available with Free WAIS). * Incremental indexing * Result set ranking * Use of native WAIS operators, e.g. ADJ to find one word adjacent to another. Displaying SOIF attributes in results ------------------------------------- In v1.2 the Broker allowed specific attributes from matched objects to be returned in the result set. However, there was no real support for this in BrokerQuery.pl.cgi. Now it is possible to request SOIF attributes with the use of HTML FORM facilities. A simple approach is to include a select list in the query form. For example: In this manner, the user may control which attributes are displayed. The layout of these attributes in HTML is controlled by the '' specification in $HARVEST_HOME/cgi-bin/lib/BrokerQuery.cf. Uniqify duplicate objects ------------------------- Occasionally a broker may end up with duplicate entries for individual URLs. This usually happens when the Gatherer changes (its description, hostname, or port number). To remedy this situation, there is a "uniqify" command on the broker interface. On the admin.html page it is described as "Delete older objects of duplicate URLs." When two objects with the same URL are found, the object with the least-recent timestamp is removed. Glimpse inline queries ---------------------- In v1.2 using Glimpse with the broker required the broker to fork a 'glimpse' process for every query. Now the broker can make the query directly to the 'glimpseserver'. If glimpseserver is disabled or not running for some reason, the broker will use the previous approach and spawn a glimpse process to handle the query. ======================================================================== CACHE Persistent disk cache --------------------- Upon startup the cache now "reloads" cached objects from a previous session. While this adds some delay at startup, heavily used sites will benefit, especially where filling the cache with popular objects is expensive or time-consuming. To disable the persistent disk cache, add the '-z' flag to cached's command line. This emulates the previous behaviour, which is to remove all previously cached objects at startup. Common logfile format --------------------- The cache now supports the httpd common logfile format which is used by many HTTP server implementations. This makes the cache's access logfile compatible with many of the freely available logfile analyzers. Note that the cache does not (yet) log the object size for requests which result in a 'TCP_MISS'. There have been many improvements to the debugging output as well. Improved Internet protocol support ---------------------------------- Numerous improvements and bugfixes have been made to HTTP, FTP, and Gopher protocol implementations. Additionally, a user-contributed patch for proxying to WAIS servers has been included. TTL calculation by regexp ------------------------- It is now possible to have the cache calculate time-to-live values based on URL regular expressions. This would allow an administrator to set large TTL's for images and lower TTL's for text, for example. These are specified in the cached.conf file, beginning with the tag 'ttl_pattern'. For example: ttl_pattern ^http:// 1440 20% 43200 The second field is a POSIX-style regular expression. Invalid expressions are ignored. The third value is an absolute time-to-live, given in minutes. This value is ignored if negative. A zero value indicates that an object matching the pattern should not be cached. NOTE: the absolute TTL is used only if the percent-of-age (described next) is not used. The fourth value is a percent-of-age factor. If the object is sent with valid Last-Modification timestamp information, then the object's TTL is calculated as TTL = (current-time - last-modified) * percent-of-age / 100; If the percent-of-age field is zero, or a last-modification timestamp is not present, then the algorithm looks at the absolute TTL value next. The fifth field is a maximum, upper-bound on the TTL to return for the percent-of-age method. It is specified in minutes, with the default being 30 days. This is provided in case a buggy server implementation returns ridiculous last-modification data. Improved customizability ------------------------ More options have been added to the cache configuration file: * String-based stoplist to deny caching of objects which contain the stoplist string (e.g.: "cgi-bin"). * Support for "quick aborting." When the client drops a connection, the cache will abort the data transfer immediately. Useful for caches behind SLIP/PPP connections. * The number of DNS lookup servers is now configurable. The default is three. * The trace mail message sent to cs.colorado.edu (containing only the IP address and port number of your cache) can now be turned off. Security -------- IP-based access controls are now supported. The administrator may deny access to specific IP networks/hosts, or may only allow access from specific networks/hosts. Two access control lists are maintained: one for clients/browsers using the cache (the "ascii port") and another for the remote instrumentation interface (cache manager). Performance Enhancements ------------------------ Several performance enhancements have been made to the cache: * The LRU replacement algorithm is more efficient and quicker. In conjunction with the new LRU replacement policy the default low water mark has been changed from 80% to 60%. * The in-memory usage (metadata) of cached objects has been reduced to 80-100 bytes per object. * The retrieval of various statistics from the instrumentation interface is much faster. * User-configurable garbage collection reduces the number of times these more expensive operations are performed. * Cleaned up and reduced overall memory management. Our checks with Purify report no memory leaks. Portability ----------- The TCL libraries are no longer needed to compile the cache. User-contributed patches have been incorporated for better support on BSD, Linix, IRIX, and HP-UX systems. Optional Code ------------- The following are recent additions to the code. They can be optionally included by setting '-D' flags in the Makefile. CHECK_LOCAL_NETS Define this to optimize retrievals from servers on your local network. If your cache is configured with a parent, objects from your local servers may be pulled through the parent cache. To always retrieve local objects directly define CHECK_LOCAL_NETS and rebuild the source code. Then add your local IP network addresses to the cache configuration file with the 'local_ip' directive. For example: local_ip 128.138.0.0 local_ip 192.54.50.0 LOG_FQDN Client IP addresses are logged in the access log file. To log the fully qualified domain name instead, define LOG_FQDN and rebuild the code. WARNING: This is not implemented efficiently and may adversely affect your cache performance. Before each line is written to the access log file, a call to gethostbyaddr(3) is made. This library call may block an arbitrary amount of time while waiting for a reply from a DNS server. While this function blocks, the cache will not be able to process any other requests. You have been warned. APPEND_DOMAIN Define this and use the 'append_domain' configuration directive to append a domainname to hostnames without any domain information. USE_WAIS_RELAY Define this and use the `wais_relay' configuration directive to allow WAIS queries to be cached and proxied. ======================================================================== MISCELLANEOUS Admin scripts ------------- A number of sample scripts are provided to aid in administering your Harvest installation: RunGatherers.sh: This script can be run from your ``/etc/rc'' scripts to start the Harvest gatherer daemons at boot time. It must be customized with the directory names of your gatherers. It is installed in $HARVEST_HOME/lib/gatherer. RunBrokers.sh: This script can be run from your ``/etc/rc'' scripts to start the Harvest brokers at boot time. It must be customized with the directory names of your brokers. It is installed in $HARVEST_HOME/lib/broker. harvest-check.pl: This Perl script is designed to be run occasionally as a cron(1) job. It will contact your gatherers and brokers and report on any which seem to be unreachable. The list of gatherers and brokers to contact can be specified at the end of the script.