Extracted function comments Sun Apr 4 18:46:12 2004 =item AdminVersion =cut =item Append =cut =item Assert Usage: #&Assert( conditional expression ); Assert is a useful debugging tool. Its one argument is a conditional that should be true in every possible case, as long as you've written your code correctly. If the argument turns out to be false at runtime, then Assert will print an error message in very large, bold letters. Often used to audit function input and output values. Possibly these Assert calls should be stripped or disabled in public releases. =cut =item Authenticate =cut =item BuildIndex Usage: &BuildIndex(); BuildIndex completely rebuilds the index for a local realm. Because the webpages in local realms are readily accessible, this function tends to process huge data sets quickly. It is self-restartable through a meta-refresh; state information is stored in the $start_pos parameter and working data is stored either in the database or the index_file.working_copy file. For file-based indexes, all new data is written to index_file.working_copy. When the process is finished, possibly after several browser requests, the original index_file is deleted and index_file.working_copy is renamed over the top of it. Thus, users are able to perform searches on the intact index_file while the BuildIndex process in progress. In addition, it is possible to safely abandon the BuildIndex process. For SQL-based indexes, we don't have that concept of a temporary storage area. Instead, each record is updated as the webpage is encountered. At the end of the BuildIndex process, if we get there, we delete all records whose lastindex time is older than "start_time". The only records older than "start_time" are those that were not detected by GetFilesByDirEx, or that were excluded for other reasons. This is an interactive function; errors and other status messages are shown to the user by printing HTML. =cut =item Cancel =cut =item Capitalize Usage: my $cap_string = &Capitalize($string); Capitalizes English-language strings. =cut =item CheckEmail Usage: my $err = &CheckEmail( $address ); if ($err) { print "

Error: $err.

\n"; } Checks whether the argument is a valid email address or not: address not blank contains text @ text text follow @ is valid hostname (can be resolved) Based on Ian Dobson's CheckEmail function. =cut =item Close =cut =item CompressStrip Process the HTML text and various subfields like Title and Description. =cut =item Crawler_new Usage my %response = $crawler->webrequest( 'page' => 'http://www.xav.com/scripts/', 'limit' => 'http://www.xav.com/', ); if ($response{'err'}) { print "

Error: $response{'err'}

\n"; exit; } print "The HTML text of this web page is:\n\n"; print $response{'text'}; =cut =item DeleteFromPending Usage: my ($err, $delcount) = &DeleteFromPending( $realm, \@urls ); =cut =item FD_Rules_new Initializes the object that manages system settings. =cut =item FlockEx Usage: if (&FlockEx( $p_filehandle, 8 )) { # okay } Abstraction layer to protect non-flock systems. =cut =item FormatDateTime =cut =item FormatNumber Usage: my $num_str = &FormatNumber( $expression, $decimal_places, $include_leading_digit, $use_parens_for_negative, $group_digits, $euro_style ); Arguments $expression Required. Expression to be formatted. $decimal_places Optional. Numeric value indicating how many places to the right of the decimal are displayed. Note: truncates $expression to $decimal_places, does not round. $include_leading_digit Optional. Boolean that indicates whether or not a leading zero is displayed for fractional values. $use_parens_for_negative Optional. Boolean that indicates whether or not to place negative values within parentheses. Style is used for outbound formatting only; inbound parsing always uses "-" for dec (Perl's internal format) $group_digits Optional. Boolean that indicates whether or not numbers are grouped using the comma. $euro_style Optional. If 1, then "." separates thousands and "," separates decimal. i.e. "800.234,24" instead of "800,234.24". Style is used for outbound formatting only; inbound parsing always uses "." for dec (Perl's internal format) Prototyped to match Microsoft's FormatNumber function for vbscript/jscript, with the limitation of not knowing about default settings. Microsoft specification at http://msdn.microsoft.com/scripting/vbscript/doc/vsfctFormatNumber.htm or from http://msdn.microsoft.com/scripting/. Error handling: if $expression is not numeric, is treated as 0 =cut =item GetCrawlList Usage: my @list = (); my $count = 0; my $age = $::FORM{'StartTime'}; if ($::FORM{'DaysPast'}) { $age -= (86400 * $::FORM{'DaysPast'}); } my $err = &GetCrawlList( $realm, $age, $max_list_size, \@list, \$count ); Retrieves a @list of all web pages in the '$realm' realm that are older than $age. $count is the size that @list would be if no limits were imposed. @list will actually contain between 0 to $max_list_size elements. The max_list_size option is available to save memory. =cut =item GetFiles_new Used to enumerated all files and folders in a certain directory. Designed to use very little memory. Files are always returned in alphabetic order, which allows certain optimizations to be made. Usage: my $fr = &fdse_filter_rules_new(); my $gf = &GetFiles_new(); $err = $gf->create_file_list( 'base_dir' => $base_dir, 'base_url' => $base_url, 'fr' => \$fr, 'tempfile' => "$file.temp", 'no_older_than' => $num_seconds, ); my $count = $gf->{'count'}; $gf->resume_file_position( $start_pos ); while (1) { my ($lastmodt, $size, $fullfile, $basefile, $url) = $gf->get_next_file(); } $gf->quit(); # kills temp file no_older_than is the number of seconds for the maximum tolerable age of the cache file. If the file exists and is older than this, then a new file will be created. =cut =item LoadRules Usage: $err = &LoadRules(); Wrapper around FD_Rules object and it's own loadrules() method. Adds additional processing. Writes directly to the global %::Rules hash. Writes some derived data to %::const as well. =cut =item LockFile_get_read_access Gets read access to the file. Handles the "create_if_needed" logic. Tries to restore a stale "working_copy" file if not copy of the original file exists. =cut =item LockFile_new This package provides an object-oriented approach to file I/O, with support for file locking and standardized error handling. Usage: my ($err, $obj, $p_rhandle, $p_whandle) = (); Err: { $obj = &LockFile_new( 'create_if_needed' => 1, ); ($err, $p_rhandle) = $obj->Read( $file ); next Err if ($err); while ($_ = readline($$p_rhandle)) { print $_; } $err = $obj->Close(); next Err if ($err); last Err; } continue { print "

Error: $err.

\n"; } =cut =item Merge =cut =item ParseRobotFile Usage: my @forbidden_paths = &ParseRobotFile( $RobotText, $my_user_agent ); Accepts the text of a robots.txt file, and the string name of the current HTTP user-agent. Parses through the file and returns an array of all forbidden paths that apply to the current user-agent. =cut =item PrintOrderedHash Usage: my $err = &PrintOrderedHash( \%hash, $by_value, $ascii_sort, $ascending, $date_map ); =cut =item PrintTemplate Usage: &PrintTemplate( $b_return_as_string, 'tips.html', 'german', \%replace_values, \%visited, \%cache ); See "admin_help.html" for extensive documentation on this function, its limitations, its failure scenarios, etc. =cut =item RawTranslate Usage: my $lc_ai_string = &RawTranslate($string); Returns a lowercase, accent-stripped version on its input. Replaces HTML-encoded characters with their ASCII equivalents. This function is called mainly by &CompressStrip; also by &LoadRules when preparing the code for ignore words. See http://www.utoronto.ca/webdocs/HTMLdocs/NewHTML/iso_table.html =cut =item Read =cut =item ReadFile Usage: my ($err, $text) = &ReadFile($file); if ($err) { print "

Error: $err

"; } else { print "

File '$file' contains:

"; print "

$text

"; } Easy-to-call file-reading function. Calls super-robust LockFile object under the hood, which is a relatively expensive call. This is done for operations which read data from the file system into memory, and then save data back to the file system. For these operations, we cannot afford to have a single failed read operations cause permanent data loss. Examples of read failures would be "file locked for writing by another process". =cut =item ReadFileL Usage: ($err, $text) = &ReadFileL( $filename ); Returns the text of the given file, or an error. Uses direct disk I/O rather than the more expensive LockFile package. =cut =item ReadInput Reads CGI form input, or command-line parameters. Initializes %$p_FORM and assigns values. Usage: &ReadInput(); Abstracts the source of the commands (can be query string, standard input, or command-line parameters). Automatically updates global hash %::FORM. =cut =item ReadWrite =cut =item Resume =cut =item SaveLinksToFileEx Usage: my $err = &SaveLinksToFileEx( $p_realm_data, $ref_crawler_results, $ref_spidered_links, $ref_links_new, $ref_links_visited_fresh, $ref_links_visited_old, $ref_links_error, ); if ($err) { print "

Error: $err.

\n"; } Saves all links from this crawl sessions to the pending pages file (search.pending.txt). File format is: URL &ue(realm) number where number is one of: 0 => waiting to be indexed 2 => encountered problems during index 2+ => epoch time of the index operation =cut =item SearchIndexFile Usage: &SearchIndexFile( $index_file, $search_code, \$pages_searched, \@HITS ); Searches the given index file. Uses by-reference return values for the total pages searched and the array of hits. =cut =item SearchRunTime Usage: &SearchRunTime( $realm, $DocSearch, \$pages_searched, \@HITS ); =cut =item SelectAdEx Usage: my @Ads = &SelectAdEx(); Returns the text for up to 4 ads. If keywords present in $::private{'search_term_patterns'} then the ads will be keywords-based. =cut =item SendMailEx Specification Lightweight, portable, Perl library for sending mail in a reliable fashion. Designed for the occassional message, not for being a massive 24x7 mailer. Requirements: absolutely zero dependencies; no external Perl modules, etc. clean: use strict, -w, -W, -T, prototypes ok callable as a single standalone function, not a package. use byref hash to optionally preserve state between calls must be able to send mail w/ raw sockets for those hosts without command-line sendmail (NT) must be able to send mail w/ command-line sendmail for those hosts without sockets privileges on port 25 (free webhosts) allow caller to specify buffered/unbuffered I/O (sysread vs read, syswrite vs print) must be very safe with user data - try really hard not to lose messages (retry, option to save to disk on socket failure, etc.) able to send mail multiple ways - sockets, |sendmail, or save-to-file must comply with "run 4ever" goal - don't overflow file system with saved messages, etc. allow verbose/debug mode which traces all socket traffic when possible, should auto-detect necessary SMTP servers - currently uses `nslookup` use extracted strings array for error messages. allow caller to import a translated set. do not write to STDOUT; do your work and return error status; let calling code deal with the user Internal Structure: Network Client Cache - %nc_cache - $p_nc_cache hash (or reference to) with: values: V:loaded = 1 or undef depending on whether these values have been queried: $$p_nc_cache{'V:PF_INET'} = PF_INET(); $$p_nc_cache{'V:SOCK_STREAM'} = SOCK_STREAM(); $$p_nc_cache{'V:PROTO'} = scalar getprotobyname('tcp'); hostnames: (all hostnames converted to lowercase) H:foo.bar.com => 4-byte IP address or undef() Usage: my $message = <<"EOM"; Hi there Bob! How has life been treating you? Regards, Joe EOM my ($err, $trace) = &SendMailEx( 'to' => 'user@host.com', 'to name' => 'Bob User', # * 'from' => 'me@host.com', 'from name' => 'Sally User', # * 'subject' => 'Hi Sally', # * 'message' => $message, 'host' => 'mail.foo.com', # * 'port' => 25, # * 'saveto' => 'e:/saved_msgs', 'max_saved_messages' => 1000, 'handler_order' => '12345', 'always_save' => 1, ); # * optional field if ($err) { print "

Error: $err.

\n"; } else { print "

Success: sent mail okay.

\n"; } print "

Here is the trace:

\n\n"; print "\n$trace\n\n"; SendMailEx knows of 2 ways to handle a message: 1. pipe the message to a process, such as /usr/sbin/sendmail or c:/blat.exe, defined with the 'pipeto' parameter If using /usr/sbin/sendmail, include the "-t" flag in the pipeto input, i.e.: 'pipeto' => '/usr/sbin/sendmail -t', 2. deliver to a known SMTP server, defined using the 'host' paramater The options are listed above in the order of speed and reliability. Saving the message to a folder is generally just a failover method to prevent the loss of user data - no message will actually be sent. By default, SendMailEx will attempt those methods in order. You can override this with the 'handler_order' parameter, which is a string like "12345" or "54321" or "23". If parameters 'pipeto', 'host', or 'saveto' aren't defined, this process will skip the handling methods which depend on them. =cut =item SetDefaults Usage: my $text = &SetDefaults( $html, \%params ); Takes $html, which is an HTML fragment including FORM elements, and sets all default attributes to match %params. Requires strict format: Generally will accept double-quoted attributes, and unquoted attributes which don't contain any embedded space. In the case of replacing "hidden"-type fields, will only insert new values for hidden form elements that do not already have a value. This code will insert CHECKED and SELECTED attributes for the appropriate form elements, but will not overwrite existing CHECKED and SELECTED attributes. The recommended way to formulate your input forms is to not use these explicit defaults. The code will overwrite default VALUE="x" values for INPUT TEXT and INPUT PASSWORD and TEXTAREA. =cut =item StandardVersion The following three functions return the HTML text for printing a single hit. &StandardVersion() returns the normal text, &AdminVersion() returns the same text as StandardVersion with the addition of "Edit" and "Delete" buttons as well as re-routing all links through the redirector Usage: my $textoutput = &StandardVersion(%pagedata); =cut =item Suspend Used for ReadWrite activity that spans multiple object lives. Two relevant methods, Suspend and Resume. Suspend saves the read/write depth of the related files to the $filename.exclusive_lock_request file. Resume opens the files as would ReadWrite (does oppositive checks - the .elr and .tmp must exist). It seeks to the appropriate places in the files before handing the handles back. =cut =item Trim Usage: my $word = &Trim(" word \t\n"); Strips whitespace and line breaks from the beginning and end of the argument. =cut =item UpdateIndex For local realms. Update procedure used to update all records. Usage: ($err, $is_complete) = &UpdateIndex( $p_realm_data ); Algorithm: (Must all be done in a single process... not restartable...) Use GetFiles() to create a list of all files and their lastmod times Build a hash of $lastmod{url} = time loop through all records in the existing index unless lastmod(url) delete record next delete lastmod(url) if (lastmod(url) == lastmod_index preserve record else (file = url) =~ s!^base_url!base_dir!o; record = build_new_record(file) update record } foreach (keys %lastmod) (file = url) =~ s!^base_url!base_dir!o; record = build_new_record(file) insert record =cut =item WriteFile Usage: $err = &WriteFile( $file, $text ); This is a wrapper around the LockFile object and it's ReadWrite method. Useful for writing small text files where the entire file contents can be stored in memory ($text). =cut =item WriteRule Attempts to save the name-value pair to the Rules hash. If the $name-$value pair being assigned is already the current setting in %::Rules, then this function will short-circuit and return a success result. Usage: $err = &WriteRule( $name, $value ); if ($err) { print "

Error: $err.

\n"; } =cut =item _fdr_validate Usage: my $FDR = &FD_Rules_new(); my ($is_valid, $valid_value) = $FDR->_fdr_validate($name, $value); Returns Boolean whether the rule is valid, according to the internal %defaults array. Note that $name's which are not defined in %defaults will always return as valid, with $valid_value = $value. For Boolean data types, a $value which is undefined or a null string will return $is_valid = 1 with $valid_value = 0. Returns $valid_value as either argument $value, or the onboard default. =cut =item _handle_folder Recursively-called function for gathering all the files in a folder which need to be indexed. =cut =item _load_filter_rules =cut =item add This method will check for the existence of index files; if they don't exist, it will attempt to create a zero-byte file. If the creation fails, it will not load the realm. =cut =item add_filter_rule Usage: $err = $fr->add_filter_rule(); =cut =item admin_link Usage: my $link = &admin_link( 'Action' => 'Foo', 'Name' => 'Value, ); Returns an admin URL with the passed name-value parameters. Will URL-encode the names and values. =cut =item admin_main Usage: $err = &admin_main(); =cut =item anonadd_main Function controlling visitor submissions of URL's. =cut =item basetime =cut =item build_plural_pattern Usage: $term = &build_plural_pattern( $term ); Returns a Perl regular expression which will match all common plural forms of $term. If $term is a phrase (i.e., contains embedded spaces), then each word in $term will be converted to the appropriate pattern. Thanks to http://owl.english.purdue.edu/handouts/grammar/g_spelnoun.html my %tests = ( 'dog' => 'dogs?', 'dogs' => 'dogs?(es)?', 'potato' => 'potato(es|s)?', 'potatoes' => 'potato(|e|is|es)', 'potatos' => 'potatos?(es)?', 'school' => 'schools?', 'church' => 'church(es)?', 'zoo' => 'zoos?', 'fox' => 'fox(es)?', 'foxes' => 'fox(|e|is|es)', 'guess' => 'guess?(es)?', "\\ family \\" => "\\ famil(ies|y) \\", 'family' => 'famil(ies|y)', 'family of dogs' => 'famil(ies|y) ofs? dogs?(es)?', 'family of dog' => 'famil(ies|y) ofs? dogs?', ); my ($in, $out); while (($in, $out) = each %tests) { my $test_out = &build_plural_pattern( $in ); if ($test_out eq $out) { print "test '$in' to '$out' ok\n"; } else { die "error - '$in' converted to '$test_out' but expected '$out'"; } } =cut =item check_filter_rules TODO: document the p:, p:m:, and _udav namespaces Note: all regex passed to this subroutine are already guaranteed valid by the &validate() routine called earlier by the object. Thus no error checking is done on regex. Usage: my $url_to_get = 'http://www.xav.com/'; my $document_text = ''; my $fr = &fdse_filter_rules_new(); my ($is_denied, $requires_approval, $promote_val, $filter_err, $no_update_on_redirect, $b_index_nofollow, $b_follow_noindex) = (); ($is_denied, $requires_approval, $promote_val, $filter_err, $no_update_on_redirect, $b_index_nofollow, $b_follow_noindex) = $fr->check_filter_rules( $url_to_get, '', 1); if ($is_denied) { print "

URL '$url_to_get' is denied - $filter_err

"; exit; } $document_text = get( $url_to_get ); ($is_denied, $requires_approval, $promote_val, $filter_err, $no_update_on_redirect, $b_index_nofollow, $b_follow_noindex) = $fr->check_filter_rules( $url_to_get, $document_text, 0); if ($is_denied) { print "

URL '$url_to_get' is denied - $filter_err

"; exit; } if ($requires_approval) { #queue } else { # add to index } =cut =item check_parse_patterns Usage: &check_parse_patterns( $text, \%metadata ); =cut =item check_regex Usage: $err = &check_regex($pattern); Checks against ?{} code-executing expressions. Uses an eval wrapper to confirm that the expression is valid. =cut =item check_rule =cut =item choose_interface_lang Usage: ($err, $options_string, $lang) = &choose_interface_lang( $b_is_admin_rq, &query_env('HTTP_ACCEPT_LANGUAGE'), ); next Err if ($err); This subroutine provides the logic for selecting which language to use, based on the various user settings (via the function arguments) and the system settings (via the global %::Rules hash). Return value is $options_string as a chain of