Website Provider: Outflux.net
URL: http://jnocook.net/web/files.htm


[Tags overview] [Page design] (File organization) [Tools]


File Organization for Web Pages

(Up Feb 77, last modified 7 Jan 04) (Oct 06, email addresses munged -- beware) This page deals with the organization of the HTML files at the server location, and on your DOS or Windows machine, and offers some DOS/Win batch files utilities and Unix scripts. Revised recently to reflect current practice.

What Other People Do

What other people do is to use the editor supplied with some browser (like "Compozer"), or something else which keeps them away from actual tags and stuff. Then the click of a button will start up the modem, make a ppp connection, start up an ftp session, and move the files onto some remote directory. Whamo!

If you use one of the browser-supplied web page editors, it can only be guaranteed that things will look great for *your* browser, on *your* machine, and might crash visually when seen on almost any other screen. I said that before.

I recommend hand coding. Use Qedit in DOS or in a DOS box in Windows, use Notepad or Wordpad in Windows, use "something plain text" on a MAC, use Vi, Emacs, or Pico on a Unix system.

If you must have a Windows HTML editor check out [HTML Writer], which allows you to hand code, and simultaneously view the results with Netscape (or another browser) (it is 16 bit, though) (also, the entities are broken).

To learn how to code by hand, just start. Get the document [HTMLPrimer] at NSCA at UIUC, read it, and in twenty minutes you will know how to write web pages. The advantage you have as an inexperienced beginner is that your tags will be more universal, easier on other browsers, and give more predictable results, simply because you will limit the tags to basics.

The server at the UNIX Box

(added Jan 2004:) I always assumed that the web page server would be at a Unix box, since Unix has been the mainstay of the internet for 30 years. But now there are also MS Servers and even MAC machines. I suggest you avoid these, MS is incredably buggy, and my only experience with MAC is that it was exceedingly slow.

To continue, the file server is the program which receives requests for web files, and sends them if they exist at predetermined locations within the local file system. The best, fastest, most frequently used, and most secure server is the open source "Apache".

Before proceeding any further, it is important to know how the Unix system -- where your files are publically located -- organizes your files. Most likely you will have an account in your name, perhaps located in the directory "/usr/home/fred" -- if your account name is "fred" or "/home/fred".

Directories are files which hold the names of other files. On MACs, and more recently on Windows, they are called "folders". I have no idea how things are arranged on MAC or MS Servers. But all 24 Unix systems use nearly the same file tree.

The "path" is the route to the location of a file from the start of things (the root), signified with "/". All the other forward slashes are separators between directories within directories, etc (DOS and Windows use a backward slash).

The last name on the path string is either another directory, or the actual file. The path to your account directory is "/usr/home/fred/". If you have a file there by the name of "statistics" then the fully qualified path-and-name is "/usr/home/fred/statistics".

Within your home directory on the UNIX box, you will have a subdirectory called (most often) "public_html" which will contain all the html and image files, and may contain additional subdirectories such as "images" or "cgi_bin". You can add additional subdirectories. This is more or less standard practice, for UNIX file organization tends to be fairly rigid.

When viewers request web page files from your site, they will type something like


Here the "~" symbol means "home directory of .." ("fred" in this case). Note that the "wherever.com" is not case sensitive, but "fred" is case sensitive, because Unix file (and directory) names are case sensitive.

The HTTP server will automatically add "/public_html/index.html" to this chain, will expand "/~fred" to "/usr/home/fred/", and will deliver the file "index.html" to the viewer. The HTTP server thus fills all the gaps and delivers the file..


Alternately, if the index.html file is not found, the server (if the system administrator has set this up) will deliver the file index.htm. The server has a list of default files which it will look for. If neither is found, the server (most likely) will deliver a directory listing. You don't want that.

Check to see if the server might deliver a file directory; if this happens, just include a blank "index.html" file in the directory whose contents you do not want seen by the public.

The use of the more-or-less hidden directory public_html is a safety measure which keeps even smart users out of Fred's home directory. Smart administrators will alias your public_html directory to some other name, and may even place it on another machine. If you don't have a public_html directory, you have to make one. Or ask.

Aliased website names

But maybe you registered a domain name, maybe something like "mywebpages.com". In that case someone has set all the information up at the Unix dns database, and the Apache server will know when it receives a request for "mywebpages.com" to translate this internally to


Make sure the administrator has also set it up so that the same results happen if a viewer types "www.mywebpages.com" instead of "mywebpages.com". The "www" is not ever required, because it is not a domain which has to be looked up, and the record for "www.mywebpages.com" only exists locally. Amazing as this may sound, a college in Chicago has never entered the domain name in their dns record. You cannot get Columbia College at "colum.edu" only at "www.colum.edu".

Directory Structure

The hard way to do things is to have all the files use html extensions, and to use fully qualified URLs, that is, of the form,


This was my habit before changing web site locations six times in one year. It guards against picky or broken browsers which have problems parsing the complete URL from the snippets you supply within the html files, but in the end it was not worth the effort of rewriting files just for the sake of fully qualified URLs.

The Easy Way to do Things

Decide to use only *.htm extensions, and get the website provider's server to actually deliver the index.htm file if the index.html file is not found. You will be able to get away with a less coding and shorter URL's.

At your home based DOS, Windows, or Linux machine, set aside a directory for your web pages, let's say "/web" (that might be "\web" in a Dos or Windows file system). Below this create directories for other purposes, maybe for image files, administrative matters, scripts, whatever, and an upload directory. Doing this might result in the following organization (I'll use Windows file system notation).

directory usage
c:\web (temporary copies of files for uploading go here)
c:\web\files (the web site htm/html files go here)
c:\web\files\pix (the web site image files go here)
c:\web\admin (various scripts, batch files, etc, go here)

Simple enough?

All the rest of this document deals alternately with how to implement this with the greatest convenience, or with a few batch files, which are detailed below. But first, a look at what to expect in each of the directories.


I keep this higher level directory empty, and only use it for files that are to be uploaded, and also for files which are being received from the web site. Whenever I'm done with ftp, I clear out the directory of these files.

Alternately, use this directory to store the ftp and telnet executables, and the "start.htm" file (see below).

By the way, ftp.exe is included with Windows98. Find it in the C:\Windows directory, and make it a desktop icon (link, shortcut).


Keep all the htm files in this directory. If you also keep all the image files here it will simplifies things somewhat. By the time you end up with a dozen html files, and two dozen images, it may be time to move the image files elsewhere.


How easy it is to look through a few hundred files which mix *.htm, *.gif, and *jpg, all depends on the capabilities of the directory browser you use. You may want to keep image files segregated from the html files, because it may just get too difficult to find anything in a directory which lumps all of them together. I haven't had trouble until I hit 1000 files in a directory. Do what you want.

The URLs for an image file would then be written in web files as ..

IMG SRC="./pix/filename"

.. and at the UNIX system you would use the same names and the same directory structure. You could use any name whatsoever instead of "pix."


Used (by me) to store Unix scripts, CGI files, and other stuff. I also have separate directories for incoming images (which need to be "worked on" before they become part of an html file), etc.

Browser Setup for Local Viewing

Any browser can be set up to start up with a local file. Create the file and save it (perhaps in the directory "\web").

The file start.htm could read as follows, where I have also added the ability to jump to the remote web site or start up other files (like a bookmark file, or a listing of search engines)...

<h2>START.HTM at GW</h2>

<a href="file:///c|/web/index.htm">[spaces local]</a>
<a href="http://spaces.org/index.htm">[Spaces remote]</a>

The start.htm file will be read even though it is not a proper html file. Go to the start.htm file with the browser, and then select it to be the default starting point (rather than logging on to some remote location every time you start up the browser). The additional advantage of starting a browser with a local file is that you will not have to establish a PPP phone connection to see your files.

Note that all the slashes above are forward. Some browsers will take forward slashes and translate them correctly for a Windows file system, some will not. Some require only "c:\net\start.htm" -- that is, without the "file" and the three slashes. Some will require the colon of "hard drive C" to be substituted with a pipe symbol (|).

If you dont want all the links (I show only two) in a line, preceed each anchor with <LI> or <BR> -- the BR tag is more compact.

DOS Batch File Utilities

The following describes a set of utilities which depend on DOS Commands, any of which can be found (perhaps under different names) at [http://uiarchive.cso.uiuc.edu/info/search.html]. Specifically, these include GREP, GSAR, FOLD, NODUPES, AWK and BASENAME (from the Timo utilities at Garbo). Download a Zip file of these (126K), complete with docs. [download]

The following batch files use these utilities. The batch files operate very fast, taking only a second or two to make changes to a hundred files.

In the following listings, I have deleted the paths for the called executables -- presumably your utilities are to be found on the path.

Some batch files, especially a few which operate from a logged directory come to a stop (with a "pause") to ask, "Do you really want to do this?"


The following uses Gsar exclusively to make specific changes to all the *.htm files in a directory. To use the batch file, rewrite it for the specifics of the changes you want to make.

The example below rewrites the color codes for a batch of htm files. Other examples below.

 at echo off
echo. webfix.bat - from logged directory
:: less-than is :060
:: greater-than is :062
:: space is :032
:: use :: for :
:: EOL is :013:010
:: -o overwrite original file
:: -i no case check
echo. changing colors
 gsar -s#ddFFFF -r#ccFFFF -i -o *.htm
 gsar -s#FFddFF -r#FFccFF -i -o *.htm
 gsar -s#FFFFdd -r#FFFFcc -i -o *.htm
 gsar -s#ddFFdd -r#ccFFcc -i -o *.htm
 gsar -s#ddddFF -r#ccccFF -i -o *.htm
 gsar -s#FFdddd -r#FFcccc -i -o *.htm
 gsar -s#0000FE -r#0000FF -i -o *.htm
echo. ...........done


The above file reads through all the htm files (seven times in this case) to make changes, overwriting the original files.

Some additional examples..

Note that I almost always include (and reinsert) some markers, like parentheses, quotes, or colon_slash_slash, so that the phrase will only be changed at the desired location, and not in some arbitrary location in the text which might happen to contain the same wording as the search specifies.


To find the existance of a word or phrase in a set of htm files, use Webgrep.bat listed below. This batch file writes a list of lines containing the asked-for text, along with the name of the files where these occurr, and writes it as file "foo."

&echo off
grep  -i  "%1" *.htm  > foo


If, for example, you are looking for the word "foobar," start this batch file from the directory where you want to do the search, by typing..

webgrep foobar

If you have a listing program or an editor available, you can open up foo for screen display by adding another line to the batch file...

list foo


To make a list of all the links which are referenced in a set of html files. For this use the following batch file, here called Weblinks.bat

 at echo off
echo. weblinks.bat, to extract background, href, src files-names
echo. writes to file "links",  from LOGGED DIRECTORY
:: initialize 1.tmp
echo. > 1.tmp
echo. search for .htm .gif .jpg (takes time)
:: all file names are lower case
:: all filename sources are quoted
awk " $0 ~/\.htm|\.gif|\.jpg/ { print } " *.htm >> 1.tmp
echo. save src hrefs background
:: first change all to upper case
gsar -ssrc= -rSRC= -i -o 1.tmp
gsar -shref= -rHREF= -i -o 1.tmp
gsar -sbackground= -rBACKGROUND= -i -o 1.tmp
awk " $0 ~/SRC|HREF|BACKGROUND/ { print } " 1.tmp > 2.tmp
echo save only quoted text
awk " $0 ~/\"/ { print } " 2.tmp > 3.tmp
echo. subst lf for quotes
gsar -s" -r:013:010  -i -o 3.tmp
echo. subst lf for =
gsar -s= -r:013:010 -i -o 3.tmp
echo. selecting out .htm .gif .jpg
awk " $0 ~/\.htm|\.gif|\.jpg/ { print $1 } " 3.tmp > 4.tmp
echo. deselecting  hash
awk " $0 !~/\#/ { print $0 } " 4.tmp > 5.tmp
echo. sorting (long wait)
sort < 5.tmp > 6.tmp
echo. removeing duplicates lines
nodupes 6.tmp > links
echo. remove temp files
del ?.tmp
echo. done with weblinks


What you get here is a list of every htm, gif, and jpg file which is called upon by the set of html files in a directory, presented as a single column, in lower case, and sorted alphabetically. Only the name links (hash links) are ignored. Modify the above file to suit your needs.

One of the problems with the above batch file is that Awk has line length limits. The batch file will thus screw up if the file(s) to be inspected use very long lines, or are jammed. To clean up a file, and make it easy to read, use webjam.bat and unjam.bat, below, in succession.


Wrote this to jam html files before uploading them to the Unix box. Jamming is where you remove all the blank space and the end-of-line markers from a file so that it is, in effect, one endless line long. Browsers don't care; people go bonkers with this.

It saves a little space, saves upload time, saves server delivery time, makes them virtually impossible to read on "view source" from a browser, will get mangled if imported to a text editor, and tends to screw up the browser and print spooler of anyone who attempts to print the file.

I have seen meaner versions which start with a giant blank space, that is, a hundred EOL's, or worse, insert Form Feeds. This makes printers just spit out paper, and makes a viewer think that perhaps there is nothing there, shut down the printer, and give up.

 at echo off
echo. WEBJAM.BAT calls other
echo. from logged directory
echo. jams ALL HTM files
echo. unless other extension is specified
set file=*.htm
if "%1"=="" goto skipped
set file=%1
echo. do for -%file%- files?
echo. ... part 1: replace eol with space
for %%D in (%file%) do call webjam1.bat %%D
echo. ... part 2: replace double spaces
for %%D in (%file%) do call webjam2.bat %%D
echo. ............ done


The two called files (webjam1 and webjam2) follow:

 at echo off
:: webjam1.bat, called by webjam.bat
:: replace eol with space
:: if from Linux, replace 10 with space
gsar -s:013:010  -r:032 -i -o %1
gsar -s:010  -r:032 -i -o %1


 at echo off
:: webjam2.bat, called by webjam.bat
:: replace double spaces
gsar -s:032:032  -r:032 -i -o %1


If you want to remove all extra blank spaces (browsers just skip over blanks anyway), then go through a set of files recursively, as follows..

 at echo off
:: blanks.bat
:: replace almost all blank spaces
gsar -s:032:032:032:032 -r:032 -i -o *.htm
gsar -s:032:032:032 -r:032 -i -o *.htm
gsar -s:032:032 -r:032 -i -o *.htm
gsar -s:032:032 -r:032 -i -o *.htm



And this cleans things up again..

 at echo off
echo. UNJAM.BAT from current directory
echo. removes all blanks, eols, adds EOL before every LT
echo. overwrites original
echo. ..........making changes
for %%D in (*.htm) do call unjam1.bat %%D
echo. ..........renaming
del *.htm
ren *.un *.htm
del foo


The called file "unjam1.bat", below. It will start all opening tags at the beginning of a line, and place closing tags on separate lines, and introduces a blank line before any P, HR, FRAME, LI, or CENTER tags. Adjust it to suit your needs. At the end of the script the lines get folded to 77 spaces.

 at echo off
:: unjam1.bat part of unjam.bat
:: less-than :060; greater-than :062; :: use :: for :
:: blank is :032, ? is :063, * is :042, EOL is :013:010  / is :047
echo. doing %1
::write a bak file
basename %1
%1  %basename%.bak >: nul
:: operate on a different file name, so this call doesnt repeat
copy %1  foo  >  nul
:: first add a blank space and eol at close of the file
gsar -s:060:047html:062 -r:060:047html:062:032:013:010 -i -o  foo
:: removes eols, for DOS
gsar -s:013:010 -r:032 -i -o  foo
:: removes eols (dec13) UNIX
gsar -s:013 -r:032 -i -o  foo
:: removes many blanks
gsar -s:032:032 -r:032 -i -o  foo
:: puts DOS eol in before opening tags
gsar -s:060 -r:013:010:060  -i -o  foo
:: removes eols  before terminating tags
gsar -s:013:010:060:047  -r:060:047   -i -o  foo
:: puts DOS eol in before H tags
:: has to do for both cases to retain local CAPS or LC
gsar -s:060h -r:013:010:060h   -o  foo
gsar -s:060H -r:013:010:060H   -o  foo
:: puts DOS eol in before P tags
:: has to do for both cases to retain local CAPS or LC
gsar -s:060p -r:013:010:060p   -o  foo
gsar -s:060P -r:013:010:060P   -o  foo
:: puts DOS eol in before FRAME tags
:: has to do for both cases to retain local CAPS or LC
gsar -s:060F -r:013:010:060F   -o  foo
gsar -s:060f -r:013:010:060f   -o  foo
:: puts DOS eol in before LI tags
:: has to do for both cases to retain local CAPS or LC
gsar -s:060li -r:013:010:060li   -o  foo
gsar -s:060LI -r:013:010:060LI   -o  foo
:: puts DOS eol in before CENTER tags
:: has to do for both cases to retain local CAPS or LC
gsar -s:060c -r:013:010:060c   -o  foo
gsar -s:060C -r:013:010:060C   -o  foo
:: fold text
fold -s -w 77 foo >  %basename%.un


Basename.exe (called three times in the batch file above) is available from Garbo in the Timo DOS utils subdirectory.

some Unix scripts

Here are a few Unix scripts. Most of these are in current use, almost all of them are run automatically from crontab entries. More on that below, but first a word about editors.

There are three editors available on Unix boxes. The fastest is Vi, but it is just a bitch to learn. Then there is Emacs, which will do just about anything, including logging on to the internet -- it represents a whole way of life. Too much stuff if you ask me. Finally there is Pico, the editor which comes with Pine, the email program.

I suggest Pico. Set up your default editor as follows, in the ".bash_login" or the ".profile" file as

export EDITOR

.. or something similar if you are using another shell besides "bash". (To check what shell you have been assigned, do "finger {your account name}". To find how the shell works, do "man {shell name}" -- without the {} brackets or quotes.)

Link Integrity

Of all the administrative tasks you might get involved in, the most important is to make sure that all the links work and the expected files are in place at you website. You could just log in with a browser to your index file, and then check all the links, and check for all the images, but this can get out of hand by the time you have a thousand files.

The following Unix Perl script, webcheck, was written to shorten that job. The script looks at all the htm and html files (and other files such as shtml and php) at the site, and checks for the existance of all local links, even if located in subdirectories.

Webcheck will not find orphaned files (although it can be modified to do so -- see below), but that doesn't matter as far as your viewers are concerned. It will check the links of orphaned htm files, though. Webcheck will not check external links or name links (hash links). At the end of the task it will send you email with a list of missing files.

In 2000 I wrote a wrapper so that "webcheck" would operate recursively through all subdirectories from the root directory of a website.

The wrapper is called check and goes as follows (below). The assumption for the script is that check and webcheck will be found on the path. A likely placement would be in /usr/local/bin which you prolly have access to. Otherwise append the full path to the exec of find.

# $Id: check,v 3.2 2001/06/14 05:11:36 jno Exp $
# /usr/local/bin/check - recursive use of webcheck,
# this is a wrapper file -- takes time.
# --- invoke from any directory 12/11/00
find . -type d -exec 'webcheck' {}  \;


The Perl script for webcheck, complete, goes like this..

# $Id: webcheck,v 3.28 2002/11/30 20:50:54 jno Exp $
# /usr/local/bin/webcheck, Usage: webcheck {directory}
# 	 defaults to $ENV{LOGNAME} with no commandline parameters
# DESCRIPTION: A html linting utility written in Perl, which 
#        checks all internal anchor, img, and background links, 
#        and can be used recursively. See notes below program.

########## NOTE: Make a selection of file extension you wish this
##########       program to check for. Include images.

$extensions = "htm|html|txt|jpg|gif|zip|wav";

$currentdate = (`date`);
$currentdir = $ARGV[0] ;
# set cwd if argv[0] is not set
($ARGV[0]) or die "must specify directory\n";
chdir "$currentdir"  ;
print "   directory: $currentdir\n" ;

########## NOTE: add or delete the "htm", "html", or "shtml" file 
########## extensions as needed in the line below. See notes below.

foreach $filename (`ls *.htm *.html`)     # 1- each file
  $all = () ;                             # reset the blurch
  undef $/;                               # K: undef eol  
  open (FILE, "<$filename");
  print "$filename";
  $all = join ("",<FILE>);                # all of file slurped up
  close FILE;

 at code =();                                # find <..> segms and slurp
  while ($all =~ /<[^>]+>/gim) {push ( at code, $&)}

$links=();                                # clear links collection
  foreach $code ( at code) 
	{                                 # 2- each segment inspected
                                          # start not-any (" = blank),
                                          # repeat, follow with period, 
                                          # end with known extension.

  while ($code=~ /[^(\"|=| )]+\.($extensions)/gims) 
		{                         # 3-

    if ($& =~ /:\/\//) { next }           # skip http files

########## NOTE: Hash out the following line to speed up WebCheck,
##########       and see notes under "Orphans" below.

    elsif (-e $&) { system (touch, $&)}   # touch if exists

    else { $links .= " -- $&\n" unless -e $&};    # list if not exist
		}                                 #-3
	}                                         #-2

 if ($links) { $missing .= "$filename$links\n" }  # assoc w filename 
}                                                 #-1

#########       Send email locally (to owner) -- see notes
######### NOTE: if sendmail delays delivery, use procmail instead.
#########       Both forms are shown below.
#########       Or run /usr/sbin/sendmail -q as root

 if ($missing) {
open (MAIL, "|/usr/sbin/sendmail -oi -n -t"); 
# open (MAIL, "|/usr/bin/procmail -Y"); 
print MAIL <<EOF;

 "WebCheck" searches all *.htm and *.html files in the logged directory
 for word chunks ending in the following file extensions...
 The current directory and listed paths of a link are inspected for the 
 existance of files. Fully qualified URLs and name anchors are skipped.
 Today's date.. $currentdate
 This check was made from .. $currentdir

 Missing links are listed by source filename below... 

print "====== ERRORS reported via email ======\n\n" ;
else { print "    == no email report ==\n\n" };

#                            SETUP:     
# - make note of perl and sendmail (or procmail) location, and the 
#   To: header, and make corrections as needed. The "To:" is currently 
#   set to $ENV{LOGNAME}. If email notification is to be send elsewhere, 
#   change "To:$ENV{LOGNAME}" to another email address. 
#   Be sure to escape the " at " as "\ at "
# - See notes in the body of the program concerning appropriate use 
#   of sendmail or procmail. 
# - Set the file extensions to be looked for at the variable 
#                   $extensions="aaa|bbb|ccc";
#   be sure (1) the right side is enclosed in quotes followed with ;
#   (2) the extensions are separated with the | sign.
#   The variable $extensions may be found at the top of the program.
# - if "*.html" files are not used, delete this from the line of code
#              foreach $filename (`ls *.htm *.html`)
#   "ls" will write a "file not found" message to the screen if there 
#   are no "html" file extensions, yet this is included in the list.
#   Similarly other forms such as "php" can be added in this line.
#                            USAGE:
# WebCheck searches all *.htm and *.html files (or other file extensions
# as specified) in the current directory for "word chunks" ending in
# common file extensions included within HTML tags. The current directory,
# and any directory included as part of a filename, are inspected for the 
# existance of file names derived from these "link-like" word chunks. 
# The user account is notified by email of missing files. A separate
# e-mail will be sent for each directory where missing file names were 
# discovered. The missing files are listed by the name of the htm (or
# html) file where these are called.
# Note that _all_ of the files will be inspected, including orphans.
# Thus if you receive strange messages about some files, suspect that
# they may be files requested as links from abondened html files. 
# Webcheck can determine orphaned files, that is, files to which no links
# exist, because all inspected files are touched. Orphans thus show up as 
# files with earlier dates ("ls -tl" will list and group by dates). 
# Note that all orphans will not be identified unless WebCheck has been
# executed in each subdirectory. See notes on a recursive wrapper, below. 
# NOTE: touching files is very time consuming, since the process is
# repeated at every instance a file is encountered. To VOID the 
# ability to identify orphaned files, comment out the line..
#                elsif (-e $&) { system (touch, $&)}   
# To have webcheck operate recursively through a file system, execute
# the following (this wrapper file is available as "check")..
# 	find . -type d -exec './webcheck' {}   \;
# (exactly as it appears above) from some starting point in the directory
# system. This assumes webcheck can be found on the path (as for example,
# in /usr/local/bin) or that a copy of webcheck is found in the root
# directory where file checking is started. 
# Webcheck operates verbosely, listing all the filenames which are
# inspected. to operate silently, hash the lines..
# 	print "$currentdir\n" ;
#	print "$filename";
#	print "   === ERRORS reported via email ===\n\n" ;
#	else { print "   === no email report ===\n\n" };
# Note that files linked from orphaned files will show as active files
# until the orphans are removed. See "about orphans" above.
# WebCheck will catch _any_ word-like link file names (with names of any
# size, including path names), including any nonalphabetical characters 
# except the double quote, equal sign, parenthesis, and included blanks.
#                          BUGS AND CAVEATS:
# - WebCheck lists missing links as often as they occur in a html file.
# - All anchors of the form "href=http://... etc" are ignored. 
# - Name anchors links of the form "file.htm#goto" are stripped of 
#   the information after the # mark before testing. 
#                          COPYRIGHT NOTICES:
# Copyright (C) 1998 2001 Cornelius Cook, Counterpoint Networking, Inc.
#        cook (at) outflux (dot) net
# Developmental design: Jno Cook, Aesthetic Investigation, Chicago
#             jno (at) blight (dot) com
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation, version 2.
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# GNU General Public License for more details.
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
# http://www.gnu.org/copyleft/gpl.html


Get a copy of both files not as source but as [text] and clean them up, move them to your Unix account and preceed as follows:


Crontab is a listing of what programs or scripts to run at certain times or on certain dates -- see below. Here is a typical crontab listing. I access this with the command

crontab -e

This places your current crontab file in the editor. Looking typically as follows. (To find out how crontab works do "man crontab".)

# min hr dom mo dow, *- every
# send access data on sunday nights at 11:20
20 23 * * 0 /home/jno/scripts/send.access
# check error log daily, late at night and send
45 23 * * * /home/jno/errors
# at 1:11 am  update all the log files
11 1 * * * /home/jno/scripts/ai.access
# rewrites the records on first of the month at 6:11a
11 6 1 * * /home/jno/scripts/monthly.update


I'll go through these one by one. In each instance, be aware of the following..


This is the basic script which greps the log files and extracts information. The number of file hits for this month is added to the number stored in the file "ai.old" and the total is written to the file "ai.hits" in the "scripts" directory and to the file "access" in the web site directory (public_html). A list of this month's domains is also written to the file "access".

# runs from crontab at 1:11am
grep  \~jno /var/log/httpd/access_log > /tmp/tmp1.$$
grep -c GET /tmp/tmp1.$$ > /tmp/tmp2.$$
awk -F" " '{ print $4,$1 }' /tmp/tmp1.$$ | ./trunc > /tmp/tmp3.$$
cut -c2-7,22- /tmp/tmp3.$$ | uniq -c > /tmp/tmp4.$$
cat ai.old  /tmp/tmp2.$$ > figure
echo + >> figure
echo p >> figure
dc < figure > ai.hits
date > /tmp/tmp5.$$
echo "Total file hits last year: 127,567 " >> /tmp/tmp5.$$
echo -n "Total file hits this year:  "  >> /tmp/tmp5.$$
cat ai.hits >> /tmp/tmp5.$$
echo "domains this month" >> /tmp/tmp5.$$
cat /tmp/tmp4.$$ >> /tmp/tmp5.$$
cp /tmp/tmp5.$$ /home/jno/public_html/access
rm /tmp/tmp*.$$


Since temporary files can get very large, they are written to the directory "/tmp" which does not have size limitations, and then deleted at the end (rm /tmp/tmp.$$) -- "$$" is the process ID of the script and unique.

The index file of the web page includes a few lines which read...

<a href="access">[file hits]</a>
File hits to this site are updated around 5 am GMT.


Thus any viewer has access to the file hits and a list of domains which have requested files during the month.

You will also see that the script pipes some data through "trunc" which is listed below. This is not needed, it is just a nicety.


Trunc is a Perl script which reduces domain names to the first three ip numbers or the last three portions of the readable domain name. It is used as a pipe. You can leave it out of the ai.access script if you wish.

#!/usr/local/bin/perl -p
# data: date garbage, a space, 4 ips ..
# first backref includes date, space, and three ips
if (/^(\S+\s\d+\.\d+\.\d+)(\.\d+)$/) {           # 1 if
	}                                        # end 1
# should split as above, now on words
# first extract date and space
# second back ref is ((word dot)+ word)
elsif (/^(\S+\s)([\S+\.]+[\S+])$/) {            # 2 elseif     
# save the first back ref
# split all of the second on dots (loses them)
	 at tmp=split(/\./,$2);
#	$last=$#tmp;
#	$dom= at tmp[$last];	
#	$name= at tmp[$last - 1];
#	$_=$first."*.".$name.".".$dom."\n";
	$_=$1."*.". at tmp[$#tmp - 1].".". at tmp[$#tmp]."\n";
}                                                 # end 2



The script "monthly.update" updates the file "ai.old" at the end of the month.

# a file to grep the previous httpd log file ...
# ...on the first of the month at about 6 am on crontab, 
cd /home/jno/scripts/
zcat /var/log/httpd/access_log.1 > /tmp/tmp1.$$
grep -c jno /tmp/tmp1.$$ > /home/jno/scripts/ai.prior
 cat ai.old  ai.prior > figure
 echo + >> figure
 echo p >> figure
 dc < figure > ai.old
#run ai.access again,
sh ai.access
rm /tmp/tmp1.$$



The script "send.acccess" sends email to you (weekly) of how many files have been accessed so far during the year. This is file-hits, not how many domains have logged in.

# script "send.access" to fetch and send access record 
# via email on Sundays
# 11/99 jno -- this script kept in /home/jno/scripts
# -- operated from crontab
#	store date
date > today
#	add access data
/home/jno/access >> today
#       mail info
mail -s lucien_access Jno (at) Blight (dot) com   < today

The above script calls "access" which is described below.


As you can see, all the scripts are kept in a subdirectory "scripts" except this one.

echo -n "... new hits  "
cd /home/jno/scripts
sh ai.access
cat ai.hits


The results are written to the screen. As you can see, it calls the script "ai.access" in the subdirectory "scripts" which was described above.


The script "errors" send email on the errors encountered by the httpd server every day late at night. Gives you something to do the next day. The error_log is grepped for the account name, the entries are simplified, and the information is sent to you via email, and a file is written to your directory which uses the date as the name.

# get error_log entries containing "jno"
grep jno /var/log/httpd/error_log  > /tmp/jno1.$$
# save only certain fields (" " field separator)
awk -F" " '{ print $2,$3,$4,$8,$13 }' /tmp/jno1.$$ > /tmp/jno2.$$
# save only entries for todays date
date '+%b %d' > datefile
grep "`cat datefile`"  /tmp/jno2.$$ > /tmp/jno3.$$
# delete "/home/jno/public_html"
awk ' sub ( /\/home\/jno\/public_html/, " " ) ' /tmp/jno3.$$ > \
# remove "]" and write to "date" file
awk ' sub ( /\]/, " " ) ' /tmp/jno4.$$ >  `date +%b%d`
#add the date and mail the shit
date '+%b %d' >> /tmp/jno4.$$
mail -s inspected Jno (at) Blight (dot) com < /tmp/jno4.$$
# take out the trash
rm -f /tmp/jno?.$$


posting file hits

The script below greps the httpd_log file for file hits and records it to the index page of the web site. The following only counts file hits (you could count domain hits also by grepping for your account name). Dc is used to tally the running total by adding the file hits from the previous months (hits.old) to the current hits (hits.new). The information is inserted into the index.htm file by rewriting it. It takes three steps, and maybe all of 2 seconds to do on the Unix box.

grep -c GET /www/logs/spaces/access_log > hits.new
# calculating total hits
cat hits.old hits.new > figure
echo + >> figure
echo p >> figure
 dc < figure > hits
# rewriting the index
 sh rewrite
cat hits
 rm hits.new figure


"Rewrite" is a script which calls two Unix awk files, which split the index file up into two temporary files (called foo and bar), at the point where -= and =- are found, which is the position where the hits are listed on the index page. Here is "rewrite:"

echo ... rewriting the index file
awk -ffirst.awk /www/jno/spaces/index.htm
awk -fsecond.awk /www/jno/spaces/index.htm
cat foo hits bar > /www/jno/spaces/index.htm
rm foo bar


first.awk and second.awk as follows:

#read all of index file up to -=, prints to foo
# should run as "awk -ffirst.awk index.htm
 /<\!DOCTYPE/, /\-\=/ { print > "foo" }


This assumes that the index.htm file starts with "<!DOCTYPE" - if not, you could awk from "<html>"

#read all of index file from =- to end, prints to bar
# should run as "awk -fsecond.awk index.htm
 /\=\-/, /<\/html>/ { print > "bar" }


This (again) assumes that the index.htm file ends with "</html>", that is, in lower case. Better check.

[Tags overview] [Page design] (File organization) [Tools]