Website Provider: Outflux.net
URL: http://jnocook.net/web/files.htm
[Tags overview] [Page design] (File organization) [Tools]
File Organization for Web Pages
(Up Feb 77, last modified 7 Jan 04) (Oct 06, email addresses munged -- beware) This page deals with the organization of the HTML files at the server location, and on your DOS or Windows machine, and offers some DOS/Win batch files utilities and Unix scripts. Revised recently to reflect current practice.
- [What other people do]; Unix file organization.
- [Directory structure]: What you need to know.
- [DOS utilities]: Making batch changes, jamming, extraction.
- [Unix utilities]: Link Integrity, log greps, crontab.
What Other People Do
What other people do is to use the editor supplied with some browser (like "Compozer"), or something else which keeps them away from actual tags and stuff. Then the click of a button will start up the modem, make a ppp connection, start up an ftp session, and move the files onto some remote directory. Whamo!
If you use one of the browser-supplied web page editors, it can only be guaranteed that things will look great for *your* browser, on *your* machine, and might crash visually when seen on almost any other screen. I said that before.
I recommend hand coding. Use Qedit in DOS or in a DOS box in Windows, use Notepad or Wordpad in Windows, use "something plain text" on a MAC, use Vi, Emacs, or Pico on a Unix system.
If you must have a Windows HTML editor check out [HTML Writer], which allows you to hand code, and simultaneously view the results with Netscape (or another browser) (it is 16 bit, though) (also, the entities are broken).
To learn how to code by hand, just start. Get the document [HTMLPrimer] at NSCA at UIUC, read it, and in twenty minutes you will know how to write web pages. The advantage you have as an inexperienced beginner is that your tags will be more universal, easier on other browsers, and give more predictable results, simply because you will limit the tags to basics.
The server at the UNIX Box
(added Jan 2004:) I always assumed that the web page server would be at a Unix box, since Unix has been the mainstay of the internet for 30 years. But now there are also MS Servers and even MAC machines. I suggest you avoid these, MS is incredably buggy, and my only experience with MAC is that it was exceedingly slow.
To continue, the file server is the program which receives requests for web files, and sends them if they exist at predetermined locations within the local file system. The best, fastest, most frequently used, and most secure server is the open source "Apache".
Before proceeding any further, it is important to know how the Unix system -- where your files are publically located -- organizes your files. Most likely you will have an account in your name, perhaps located in the directory "/usr/home/fred" -- if your account name is "fred" or "/home/fred".
Directories are files which hold the names of other files. On MACs, and more recently on Windows, they are called "folders". I have no idea how things are arranged on MAC or MS Servers. But all 24 Unix systems use nearly the same file tree.
The "path" is the route to the location of a file from the start of things (the root), signified with "/". All the other forward slashes are separators between directories within directories, etc (DOS and Windows use a backward slash).
The last name on the path string is either another directory, or the actual file. The path to your account directory is "/usr/home/fred/". If you have a file there by the name of "statistics" then the fully qualified path-and-name is "/usr/home/fred/statistics".
Within your home directory on the UNIX box, you will have a subdirectory called (most often) "public_html" which will contain all the html and image files, and may contain additional subdirectories such as "images" or "cgi_bin". You can add additional subdirectories. This is more or less standard practice, for UNIX file organization tends to be fairly rigid.
When viewers request web page files from your site, they will type something like
http://wherever.com/~fredHere the "~" symbol means "home directory of .." ("fred" in this case). Note that the "wherever.com" is not case sensitive, but "fred" is case sensitive, because Unix file (and directory) names are case sensitive.
The HTTP server will automatically add "/public_html/index.html" to this chain, will expand "/~fred" to "/usr/home/fred/", and will deliver the file "index.html" to the viewer. The HTTP server thus fills all the gaps and delivers the file..
/usr/home/fred/public_html/index.htmlAlternately, if the index.html file is not found, the server (if the system administrator has set this up) will deliver the file index.htm. The server has a list of default files which it will look for. If neither is found, the server (most likely) will deliver a directory listing. You don't want that.
Check to see if the server might deliver a file directory; if this happens, just include a blank "index.html" file in the directory whose contents you do not want seen by the public.
The use of the more-or-less hidden directory public_html is a safety measure which keeps even smart users out of Fred's home directory. Smart administrators will alias your public_html directory to some other name, and may even place it on another machine. If you don't have a public_html directory, you have to make one. Or ask.
Aliased website names
But maybe you registered a domain name, maybe something like "mywebpages.com". In that case someone has set all the information up at the Unix dns database, and the Apache server will know when it receives a request for "mywebpages.com" to translate this internally to
/usr/home/fred/public_html/index.htmlMake sure the administrator has also set it up so that the same results happen if a viewer types "www.mywebpages.com" instead of "mywebpages.com". The "www" is not ever required, because it is not a domain which has to be looked up, and the record for "www.mywebpages.com" only exists locally. Amazing as this may sound, a college in Chicago has never entered the domain name in their dns record. You cannot get Columbia College at "colum.edu" only at "www.colum.edu".
Directory Structure
The hard way to do things is to have all the files use html extensions, and to use fully qualified URLs, that is, of the form,
href="http://www.domain/account-and-directory-path/filename.html"
This was my habit before changing web site locations six times in one year. It guards against picky or broken browsers which have problems parsing the complete URL from the snippets you supply within the html files, but in the end it was not worth the effort of rewriting files just for the sake of fully qualified URLs.
The Easy Way to do Things
Decide to use only *.htm extensions, and get the website provider's server to actually deliver the index.htm file if the index.html file is not found. You will be able to get away with a less coding and shorter URL's.
At your home based DOS, Windows, or Linux machine, set aside a directory for your web pages, let's say "/web" (that might be "\web" in a Dos or Windows file system). Below this create directories for other purposes, maybe for image files, administrative matters, scripts, whatever, and an upload directory. Doing this might result in the following organization (I'll use Windows file system notation).
directory usage c:\web (temporary copies of files for uploading go here) c:\web\files (the web site htm/html files go here) c:\web\files\pix (the web site image files go here) c:\web\admin (various scripts, batch files, etc, go here) Simple enough?
All the rest of this document deals alternately with how to implement this with the greatest convenience, or with a few batch files, which are detailed below. But first, a look at what to expect in each of the directories.
C:\web
I keep this higher level directory empty, and only use it for files that are to be uploaded, and also for files which are being received from the web site. Whenever I'm done with ftp, I clear out the directory of these files.
Alternately, use this directory to store the ftp and telnet executables, and the "start.htm" file (see below).
By the way, ftp.exe is included with Windows98. Find it in the C:\Windows directory, and make it a desktop icon (link, shortcut).
C:\web\files
Keep all the htm files in this directory. If you also keep all the image files here it will simplifies things somewhat. By the time you end up with a dozen html files, and two dozen images, it may be time to move the image files elsewhere.
C:\web\files\pix
How easy it is to look through a few hundred files which mix *.htm, *.gif, and *jpg, all depends on the capabilities of the directory browser you use. You may want to keep image files segregated from the html files, because it may just get too difficult to find anything in a directory which lumps all of them together. I haven't had trouble until I hit 1000 files in a directory. Do what you want.
The URLs for an image file would then be written in web files as ..
IMG SRC="./pix/filename".. and at the UNIX system you would use the same names and the same directory structure. You could use any name whatsoever instead of "pix."
c:\web\admin
Used (by me) to store Unix scripts, CGI files, and other stuff. I also have separate directories for incoming images (which need to be "worked on" before they become part of an html file), etc.
Browser Setup for Local Viewing
Any browser can be set up to start up with a local file. Create the file and save it (perhaps in the directory "\web").
The file start.htm could read as follows, where I have also added the ability to jump to the remote web site or start up other files (like a bookmark file, or a listing of search engines)...
<h2>START.HTM at GW</h2>
<a href="file:///c|/web/index.htm">[spaces local]</a>
<a href="http://spaces.org/index.htm">[Spaces remote]</a>The start.htm file will be read even though it is not a proper html file. Go to the start.htm file with the browser, and then select it to be the default starting point (rather than logging on to some remote location every time you start up the browser). The additional advantage of starting a browser with a local file is that you will not have to establish a PPP phone connection to see your files.
Note that all the slashes above are forward. Some browsers will take forward slashes and translate them correctly for a Windows file system, some will not. Some require only "c:\net\start.htm" -- that is, without the "file" and the three slashes. Some will require the colon of "hard drive C" to be substituted with a pipe symbol (|).
If you dont want all the links (I show only two) in a line, preceed each anchor with <LI> or <BR> -- the BR tag is more compact.
DOS Batch File Utilities
The following describes a set of utilities which depend on DOS Commands, any of which can be found (perhaps under different names) at [http://uiarchive.cso.uiuc.edu/info/search.html]. Specifically, these include GREP, GSAR, FOLD, NODUPES, AWK and BASENAME (from the Timo utilities at Garbo). Download a Zip file of these (126K), complete with docs.
The following batch files use these utilities. The batch files operate very fast, taking only a second or two to make changes to a hundred files.
In the following listings, I have deleted the paths for the called executables -- presumably your utilities are to be found on the path.
Some batch files, especially a few which operate from a logged directory come to a stop (with a "pause") to ask, "Do you really want to do this?"
Webfix.bat
The following uses Gsar exclusively to make specific changes to all the *.htm files in a directory. To use the batch file, rewrite it for the specifics of the changes you want to make.
The example below rewrites the color codes for a batch of htm files. Other examples below.
at echo off echo. webfix.bat - from logged directory pause :: less-than is :060 :: greater-than is :062 :: space is :032 :: use :: for : :: EOL is :013:010 :: -o overwrite original file :: -i no case check echo. changing colors gsar -s#ddFFFF -r#ccFFFF -i -o *.htm gsar -s#FFddFF -r#FFccFF -i -o *.htm gsar -s#FFFFdd -r#FFFFcc -i -o *.htm gsar -s#ddFFdd -r#ccFFcc -i -o *.htm gsar -s#ddddFF -r#ccccFF -i -o *.htm gsar -s#FFdddd -r#FFcccc -i -o *.htm gsar -s#0000FE -r#0000FF -i -o *.htm echo. ...........doneThe above file reads through all the htm files (seven times in this case) to make changes, overwriting the original files.
Some additional examples..
- change "[index]" to "[home]" in image ALT tags
gsar -s[Index] -r[Home] -i -o *.htm
- change a file reference from "other.htm" to "chicago.htm"
gsar -sother.htm -rchicago.htm -i -o *.htm
- changing the wording "(you are here)" to "(this page)"
gsar -s(you:032are:032here) -r(this:032page) -i -o *.htm
- removing every instance of the phrase "we have moved to"
gsar -sWe:032have:032moved:032to -r -i -o *.htm
Note that I almost always include (and reinsert) some markers, like parentheses, quotes, or colon_slash_slash, so that the phrase will only be changed at the desired location, and not in some arbitrary location in the text which might happen to contain the same wording as the search specifies.
Webgrep.bat
To find the existance of a word or phrase in a set of htm files, use Webgrep.bat listed below. This batch file writes a list of lines containing the asked-for text, along with the name of the files where these occurr, and writes it as file "foo."
&echo off grep -i "%1" *.htm > fooIf, for example, you are looking for the word "foobar," start this batch file from the directory where you want to do the search, by typing..
webgrep foobar If you have a listing program or an editor available, you can open up foo for screen display by adding another line to the batch file...
list foo Weblinks.bat
To make a list of all the links which are referenced in a set of html files. For this use the following batch file, here called Weblinks.bat
at echo off echo. weblinks.bat, to extract background, href, src files-names echo. writes to file "links", from LOGGED DIRECTORY echo. :: initialize 1.tmp echo. > 1.tmp echo. search for .htm .gif .jpg (takes time) :: all file names are lower case :: all filename sources are quoted awk " $0 ~/\.htm|\.gif|\.jpg/ { print } " *.htm >> 1.tmp echo. save src hrefs background :: first change all to upper case gsar -ssrc= -rSRC= -i -o 1.tmp gsar -shref= -rHREF= -i -o 1.tmp gsar -sbackground= -rBACKGROUND= -i -o 1.tmp awk " $0 ~/SRC|HREF|BACKGROUND/ { print } " 1.tmp > 2.tmp echo save only quoted text awk " $0 ~/\"/ { print } " 2.tmp > 3.tmp echo. subst lf for quotes gsar -s" -r:013:010 -i -o 3.tmp echo. subst lf for = gsar -s= -r:013:010 -i -o 3.tmp echo. selecting out .htm .gif .jpg awk " $0 ~/\.htm|\.gif|\.jpg/ { print $1 } " 3.tmp > 4.tmp echo. deselecting hash awk " $0 !~/\#/ { print $0 } " 4.tmp > 5.tmp echo. sorting (long wait) sort < 5.tmp > 6.tmp echo. removeing duplicates lines nodupes 6.tmp > links echo. remove temp files del ?.tmp echo. done with weblinks echo.What you get here is a list of every htm, gif, and jpg file which is called upon by the set of html files in a directory, presented as a single column, in lower case, and sorted alphabetically. Only the name links (hash links) are ignored. Modify the above file to suit your needs.
One of the problems with the above batch file is that Awk has line length limits. The batch file will thus screw up if the file(s) to be inspected use very long lines, or are jammed. To clean up a file, and make it easy to read, use webjam.bat and unjam.bat, below, in succession.
Webjam.bat
Wrote this to jam html files before uploading them to the Unix box. Jamming is where you remove all the blank space and the end-of-line markers from a file so that it is, in effect, one endless line long. Browsers don't care; people go bonkers with this.
It saves a little space, saves upload time, saves server delivery time, makes them virtually impossible to read on "view source" from a browser, will get mangled if imported to a text editor, and tends to screw up the browser and print spooler of anyone who attempts to print the file.
I have seen meaner versions which start with a giant blank space, that is, a hundred EOL's, or worse, insert Form Feeds. This makes printers just spit out paper, and makes a viewer think that perhaps there is nothing there, shut down the printer, and give up.
at echo off echo. WEBJAM.BAT calls other echo. from logged directory echo. jams ALL HTM files echo. unless other extension is specified echo. pause set file=*.htm if "%1"=="" goto skipped set file=%1 echo. do for -%file%- files? echo. pause :skipped echo. ... part 1: replace eol with space for %%D in (%file%) do call webjam1.bat %%D echo. ... part 2: replace double spaces for %%D in (%file%) do call webjam2.bat %%D echo. ............ doneThe two called files (webjam1 and webjam2) follow:
at echo off :: webjam1.bat, called by webjam.bat :: replace eol with space :: if from Linux, replace 10 with space gsar -s:013:010 -r:032 -i -o %1 gsar -s:010 -r:032 -i -o %1
at echo off :: webjam2.bat, called by webjam.bat :: replace double spaces gsar -s:032:032 -r:032 -i -o %1If you want to remove all extra blank spaces (browsers just skip over blanks anyway), then go through a set of files recursively, as follows..
at echo off :: blanks.bat :: replace almost all blank spaces gsar -s:032:032:032:032 -r:032 -i -o *.htm gsar -s:032:032:032 -r:032 -i -o *.htm gsar -s:032:032 -r:032 -i -o *.htm gsar -s:032:032 -r:032 -i -o *.htmUnjam.bat
And this cleans things up again..
at echo off echo. UNJAM.BAT from current directory echo. removes all blanks, eols, adds EOL before every LT echo. overwrites original echo. DOES HTM FILES ONLY pause echo. ..........making changes for %%D in (*.htm) do call unjam1.bat %%D echo. ..........renaming del *.htm ren *.un *.htm del fooThe called file "unjam1.bat", below. It will start all opening tags at the beginning of a line, and place closing tags on separate lines, and introduces a blank line before any P, HR, FRAME, LI, or CENTER tags. Adjust it to suit your needs. At the end of the script the lines get folded to 77 spaces.
at echo off :: unjam1.bat part of unjam.bat :: less-than :060; greater-than :062; :: use :: for : :: blank is :032, ? is :063, * is :042, EOL is :013:010 / is :047 echo. doing %1 ::write a bak file basename %1 %1 %basename%.bak >: nul :: operate on a different file name, so this call doesnt repeat copy %1 foo > nul :: first add a blank space and eol at close of the file gsar -s:060:047html:062 -r:060:047html:062:032:013:010 -i -o foo :: removes eols, for DOS gsar -s:013:010 -r:032 -i -o foo :: removes eols (dec13) UNIX gsar -s:013 -r:032 -i -o foo :: removes many blanks gsar -s:032:032 -r:032 -i -o foo :: puts DOS eol in before opening tags gsar -s:060 -r:013:010:060 -i -o foo :: removes eols before terminating tags gsar -s:013:010:060:047 -r:060:047 -i -o foo :: puts DOS eol in before H tags :: has to do for both cases to retain local CAPS or LC gsar -s:060h -r:013:010:060h -o foo gsar -s:060H -r:013:010:060H -o foo :: puts DOS eol in before P tags :: has to do for both cases to retain local CAPS or LC gsar -s:060p -r:013:010:060p -o foo gsar -s:060P -r:013:010:060P -o foo :: puts DOS eol in before FRAME tags :: has to do for both cases to retain local CAPS or LC gsar -s:060F -r:013:010:060F -o foo gsar -s:060f -r:013:010:060f -o foo :: puts DOS eol in before LI tags :: has to do for both cases to retain local CAPS or LC gsar -s:060li -r:013:010:060li -o foo gsar -s:060LI -r:013:010:060LI -o foo :: puts DOS eol in before CENTER tags :: has to do for both cases to retain local CAPS or LC gsar -s:060c -r:013:010:060c -o foo gsar -s:060C -r:013:010:060C -o foo :: fold text fold -s -w 77 foo > %basename%.unBasename.exe (called three times in the batch file above) is available from Garbo in the Timo DOS utils subdirectory.
some Unix scripts
Here are a few Unix scripts. Most of these are in current use, almost all of them are run automatically from crontab entries. More on that below, but first a word about editors.
There are three editors available on Unix boxes. The fastest is Vi, but it is just a bitch to learn. Then there is Emacs, which will do just about anything, including logging on to the internet -- it represents a whole way of life. Too much stuff if you ask me. Finally there is Pico, the editor which comes with Pine, the email program.
I suggest Pico. Set up your default editor as follows, in the ".bash_login" or the ".profile" file as
EDITOR=pico
export EDITOR.. or something similar if you are using another shell besides "bash". (To check what shell you have been assigned, do "finger {your account name}". To find how the shell works, do "man {shell name}" -- without the {} brackets or quotes.)
Link Integrity
Of all the administrative tasks you might get involved in, the most important is to make sure that all the links work and the expected files are in place at you website. You could just log in with a browser to your index file, and then check all the links, and check for all the images, but this can get out of hand by the time you have a thousand files.
The following Unix Perl script, webcheck, was written to shorten that job. The script looks at all the htm and html files (and other files such as shtml and php) at the site, and checks for the existance of all local links, even if located in subdirectories.
Webcheck will not find orphaned files (although it can be modified to do so -- see below), but that doesn't matter as far as your viewers are concerned. It will check the links of orphaned htm files, though. Webcheck will not check external links or name links (hash links). At the end of the task it will send you email with a list of missing files.
In 2000 I wrote a wrapper so that "webcheck" would operate recursively through all subdirectories from the root directory of a website.
The wrapper is called check and goes as follows (below). The assumption for the script is that check and webcheck will be found on the path. A likely placement would be in /usr/local/bin which you prolly have access to. Otherwise append the full path to the exec of find.
#!/bin/bash # $Id: check,v 3.2 2001/06/14 05:11:36 jno Exp $ # /usr/local/bin/check - recursive use of webcheck, # this is a wrapper file -- takes time. # --- invoke from any directory 12/11/00 find . -type d -exec 'webcheck' {} \;The Perl script for webcheck, complete, goes like this..
#!/usr/bin/perl # $Id: webcheck,v 3.28 2002/11/30 20:50:54 jno Exp $ # /usr/local/bin/webcheck, Usage: webcheck {directory} # defaults to $ENV{LOGNAME} with no commandline parameters # DESCRIPTION: A html linting utility written in Perl, which # checks all internal anchor, img, and background links, # and can be used recursively. See notes below program. ########## NOTE: Make a selection of file extension you wish this ########## program to check for. Include images. $extensions = "htm|html|txt|jpg|gif|zip|wav"; $currentdate = (`date`); $currentdir = $ARGV[0] ; # set cwd if argv[0] is not set ($ARGV[0]) or die "must specify directory\n"; chdir "$currentdir" ; print " directory: $currentdir\n" ; ########## NOTE: add or delete the "htm", "html", or "shtml" file ########## extensions as needed in the line below. See notes below. foreach $filename (`ls *.htm *.html`) # 1- each file { $all = () ; # reset the blurch undef $/; # K: undef eol open (FILE, "<$filename"); print "$filename"; $all = join ("",<FILE>); # all of file slurped up close FILE; at code =(); # find <..> segms and slurp while ($all =~ /<[^>]+>/gim) {push ( at code, $&)} $links=(); # clear links collection foreach $code ( at code) { # 2- each segment inspected # start not-any (" = blank), # repeat, follow with period, # end with known extension. while ($code=~ /[^(\"|=| )]+\.($extensions)/gims) { # 3- if ($& =~ /:\/\//) { next } # skip http files ########## NOTE: Hash out the following line to speed up WebCheck, ########## and see notes under "Orphans" below. elsif (-e $&) { system (touch, $&)} # touch if exists else { $links .= " -- $&\n" unless -e $&}; # list if not exist } #-3 } #-2 if ($links) { $missing .= "$filename$links\n" } # assoc w filename } #-1 ######### Send email locally (to owner) -- see notes ######### NOTE: if sendmail delays delivery, use procmail instead. ######### Both forms are shown below. ######### Or run /usr/sbin/sendmail -q as root if ($missing) { open (MAIL, "|/usr/sbin/sendmail -oi -n -t"); # open (MAIL, "|/usr/bin/procmail -Y"); print MAIL <<EOF; To:$ENV{LOGNAME} From:WEBCHECK Subject:$currentdir "WebCheck" searches all *.htm and *.html files in the logged directory for word chunks ending in the following file extensions... $extensions The current directory and listed paths of a link are inspected for the existance of files. Fully qualified URLs and name anchors are skipped. Today's date.. $currentdate This check was made from .. $currentdir Missing links are listed by source filename below... \n$missing (end) EOF print "====== ERRORS reported via email ======\n\n" ; } else { print " == no email report ==\n\n" }; # # SETUP: # # - make note of perl and sendmail (or procmail) location, and the # To: header, and make corrections as needed. The "To:" is currently # set to $ENV{LOGNAME}. If email notification is to be send elsewhere, # change "To:$ENV{LOGNAME}" to another email address. # Be sure to escape the " at " as "\ at " # # - See notes in the body of the program concerning appropriate use # of sendmail or procmail. # # - Set the file extensions to be looked for at the variable # # $extensions="aaa|bbb|ccc"; # # be sure (1) the right side is enclosed in quotes followed with ; # (2) the extensions are separated with the | sign. # The variable $extensions may be found at the top of the program. # # - if "*.html" files are not used, delete this from the line of code # # foreach $filename (`ls *.htm *.html`) # # "ls" will write a "file not found" message to the screen if there # are no "html" file extensions, yet this is included in the list. # Similarly other forms such as "php" can be added in this line. # # USAGE: # # WebCheck searches all *.htm and *.html files (or other file extensions # as specified) in the current directory for "word chunks" ending in # common file extensions included within HTML tags. The current directory, # and any directory included as part of a filename, are inspected for the # existance of file names derived from these "link-like" word chunks. # # # The user account is notified by email of missing files. A separate # e-mail will be sent for each directory where missing file names were # discovered. The missing files are listed by the name of the htm (or # html) file where these are called. # # Note that _all_ of the files will be inspected, including orphans. # Thus if you receive strange messages about some files, suspect that # they may be files requested as links from abondened html files. # # Webcheck can determine orphaned files, that is, files to which no links # exist, because all inspected files are touched. Orphans thus show up as # files with earlier dates ("ls -tl" will list and group by dates). # # Note that all orphans will not be identified unless WebCheck has been # executed in each subdirectory. See notes on a recursive wrapper, below. # # NOTE: touching files is very time consuming, since the process is # repeated at every instance a file is encountered. To VOID the # ability to identify orphaned files, comment out the line.. # # elsif (-e $&) { system (touch, $&)} # # To have webcheck operate recursively through a file system, execute # the following (this wrapper file is available as "check").. # # find . -type d -exec './webcheck' {} \; # # (exactly as it appears above) from some starting point in the directory # system. This assumes webcheck can be found on the path (as for example, # in /usr/local/bin) or that a copy of webcheck is found in the root # directory where file checking is started. # # Webcheck operates verbosely, listing all the filenames which are # inspected. to operate silently, hash the lines.. # print "$currentdir\n" ; # print "$filename"; # print " === ERRORS reported via email ===\n\n" ; # else { print " === no email report ===\n\n" }; # # Note that files linked from orphaned files will show as active files # until the orphans are removed. See "about orphans" above. # # WebCheck will catch _any_ word-like link file names (with names of any # size, including path names), including any nonalphabetical characters # except the double quote, equal sign, parenthesis, and included blanks. # # BUGS AND CAVEATS: # # - WebCheck lists missing links as often as they occur in a html file. # - All anchors of the form "href=http://... etc" are ignored. # - Name anchors links of the form "file.htm#goto" are stripped of # the information after the # mark before testing. # # COPYRIGHT NOTICES: # # Copyright (C) 1998 2001 Kees Cook, Counterpoint Networking, Inc. # cook (at) outflux (dot) net # Developmental design: Jno Cook, Aesthetic Investigation, Chicago # jno (at) blight (dot) com # # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License # as published by the Free Software Foundation, version 2. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. # http://www.gnu.org/copyleft/gpl.html #Get a copy of both files not as source but as [text] and clean them up, move them to your Unix account and preceed as follows:
- Use the command "which perl" to find the location of Perl.
- Use the command "which sendmail" to find the location of sendmail.
- Open webcheck with an editor (vi, emacs, pico), and make the following changes...
- adjust the first line to point to the location of Perl, as, for example..
#!/usr/local/bin/perl- adjust the line starting with "$extensions ..." to reflect the file extension you want to look for. For example, you may want to add "png" as a file extension, as for example..
$extensions = "htm|html|gif|jpg|jpeg|txt|zip|wav|png";- adjust the line starting with "foreach $filename.." to remove either "*.htm" or "*.html" if you will never use one of these file extensions, so that it might, for example, read..
foreach $filename (`ls *.htm`)Those are "backticks", by the way, found below the ~ key.- adjust the line starting with "open (MAIL ..." to reflect the location of sendmail, as for example..
open (MAIL, "|/usr/bin/sendmail -oi -n -t");- adjust the line "To:$ENV{LOGNAME}" if you want email delivered elsewhere than to your home directory. See more detailed instructions in the body of the webcheck file, for example, to deliver the mail to me, make it read..
To:Jno (at) Blight (dot) com- save the file and exit from the editor.
- Use the commands "chmod u+x webcheck" and "chmod u+x check" and to make "webcheck" and "check" executable.
- Move "check" and "webcheck" to a directory on the path, like /usr/local/bin.
- Go to the root directory of your web files, and execute the script for recursive validation with the command "check ." and wait for email notification of errors.
Crontab
Crontab is a listing of what programs or scripts to run at certain times or on certain dates -- see below. Here is a typical crontab listing. I access this with the command
crontab -e
This places your current crontab file in the editor. Looking typically as follows. (To find out how crontab works do "man crontab".)
# min hr dom mo dow, *- every # # send access data on sunday nights at 11:20 20 23 * * 0 /home/jno/scripts/send.access # # check error log daily, late at night and send 45 23 * * * /home/jno/errors # # at 1:11 am update all the log files 11 1 * * * /home/jno/scripts/ai.access # # rewrites the records on first of the month at 6:11a 11 6 1 * * /home/jno/scripts/monthly.updateI'll go through these one by one. In each instance, be aware of the following..
- The first line of most scripts start with "#!/bin/bash" --that is the shell in use on a Linux machine. Be sure to find out what shell is in use or available on your Unix box. You can always use "#!/bin/sh" which is a generic shell. And check the location by doing "which sh".
- You need to know in what directory the httpd log files are kept, and what they are called. Ask. And you need permission to access that directory and need read-access for the files.
- You need to know how often the log files are rotated, and if older log files are compressed.
- You need to look at the log files and determine what format they use, so that "cut" and "awk" can be adjusted in the following scripts.
- You need permission to write crontab files.
ai.access
This is the basic script which greps the log files and extracts information. The number of file hits for this month is added to the number stored in the file "ai.old" and the total is written to the file "ai.hits" in the "scripts" directory and to the file "access" in the web site directory (public_html). A list of this month's domains is also written to the file "access".
#!/bin/bash # runs from crontab at 1:11am grep \~jno /var/log/httpd/access_log > /tmp/tmp1.$$ grep -c GET /tmp/tmp1.$$ > /tmp/tmp2.$$ awk -F" " '{ print $4,$1 }' /tmp/tmp1.$$ | ./trunc > /tmp/tmp3.$$ cut -c2-7,22- /tmp/tmp3.$$ | uniq -c > /tmp/tmp4.$$ cat ai.old /tmp/tmp2.$$ > figure echo + >> figure echo p >> figure dc < figure > ai.hits date > /tmp/tmp5.$$ echo "Total file hits last year: 127,567 " >> /tmp/tmp5.$$ echo -n "Total file hits this year: " >> /tmp/tmp5.$$ cat ai.hits >> /tmp/tmp5.$$ echo "domains this month" >> /tmp/tmp5.$$ cat /tmp/tmp4.$$ >> /tmp/tmp5.$$ cp /tmp/tmp5.$$ /home/jno/public_html/access rm /tmp/tmp*.$$Since temporary files can get very large, they are written to the directory "/tmp" which does not have size limitations, and then deleted at the end (rm /tmp/tmp.$$) -- "$$" is the process ID of the script and unique.
The index file of the web page includes a few lines which read...
<P> <a href="access">[file hits]</a> <br> File hits to this site are updated around 5 am GMT.Thus any viewer has access to the file hits and a list of domains which have requested files during the month.
You will also see that the script pipes some data through "trunc" which is listed below. This is not needed, it is just a nicety.
trunc
Trunc is a Perl script which reduces domain names to the first three ip numbers or the last three portions of the readable domain name. It is used as a pipe. You can leave it out of the ai.access script if you wish.
#!/usr/local/bin/perl -p # data: date garbage, a space, 4 ips .. # first backref includes date, space, and three ips if (/^(\S+\s\d+\.\d+\.\d+)(\.\d+)$/) { # 1 if $_=$1.".*\n"; } # end 1 # should split as above, now on words # first extract date and space # second back ref is ((word dot)+ word) elsif (/^(\S+\s)([\S+\.]+[\S+])$/) { # 2 elseif # save the first back ref $first=$1; # split all of the second on dots (loses them) at tmp=split(/\./,$2); # $last=$#tmp; # $dom= at tmp[$last]; # $name= at tmp[$last - 1]; # $_=$first."*.".$name.".".$dom."\n"; $_=$1."*.". at tmp[$#tmp - 1].".". at tmp[$#tmp]."\n"; } # end 2access.update
The script "monthly.update" updates the file "ai.old" at the end of the month.
#!/bin/sh # a file to grep the previous httpd log file ... # ...on the first of the month at about 6 am on crontab, cd /home/jno/scripts/ zcat /var/log/httpd/access_log.1 > /tmp/tmp1.$$ grep -c jno /tmp/tmp1.$$ > /home/jno/scripts/ai.prior cat ai.old ai.prior > figure echo + >> figure echo p >> figure dc < figure > ai.old #run ai.access again, sh ai.access rm /tmp/tmp1.$$send.access
The script "send.acccess" sends email to you (weekly) of how many files have been accessed so far during the year. This is file-hits, not how many domains have logged in.The above script calls "access" which is described below.
#!/bin/bash # script "send.access" to fetch and send access record # via email on Sundays # 11/99 jno -- this script kept in /home/jno/scripts # -- operated from crontab # store date date > today # add access data /home/jno/access >> today # mail info mail -s lucien_access Jno (at) Blight (dot) com < todayaccess
As you can see, all the scripts are kept in a subdirectory "scripts" except this one.
echo -n "... new hits " cd /home/jno/scripts sh ai.access cat ai.hitsThe results are written to the screen. As you can see, it calls the script "ai.access" in the subdirectory "scripts" which was described above.
errors
The script "errors" send email on the errors encountered by the httpd server every day late at night. Gives you something to do the next day. The error_log is grepped for the account name, the entries are simplified, and the information is sent to you via email, and a file is written to your directory which uses the date as the name.
#!/bin/bash # get error_log entries containing "jno" grep jno /var/log/httpd/error_log > /tmp/jno1.$$ # save only certain fields (" " field separator) awk -F" " '{ print $2,$3,$4,$8,$13 }' /tmp/jno1.$$ > /tmp/jno2.$$ # save only entries for todays date date '+%b %d' > datefile grep "`cat datefile`" /tmp/jno2.$$ > /tmp/jno3.$$ # delete "/home/jno/public_html" awk ' sub ( /\/home\/jno\/public_html/, " " ) ' /tmp/jno3.$$ > \ /tmp/jno4.$$ # remove "]" and write to "date" file awk ' sub ( /\]/, " " ) ' /tmp/jno4.$$ > `date +%b%d` #add the date and mail the shit date '+%b %d' >> /tmp/jno4.$$ mail -s inspected Jno (at) Blight (dot) com < /tmp/jno4.$$ # take out the trash rm -f /tmp/jno?.$$posting file hits
The script below greps the httpd_log file for file hits and records it to the index page of the web site. The following only counts file hits (you could count domain hits also by grepping for your account name). Dc is used to tally the running total by adding the file hits from the previous months (hits.old) to the current hits (hits.new). The information is inserted into the index.htm file by rewriting it. It takes three steps, and maybe all of 2 seconds to do on the Unix box.
#!/bin/sh grep -c GET /www/logs/spaces/access_log > hits.new # calculating total hits cat hits.old hits.new > figure echo + >> figure echo p >> figure dc < figure > hits # rewriting the index sh rewrite cat hits rm hits.new figure"Rewrite" is a script which calls two Unix awk files, which split the index file up into two temporary files (called foo and bar), at the point where -= and =- are found, which is the position where the hits are listed on the index page. Here is "rewrite:"
echo ... rewriting the index file awk -ffirst.awk /www/jno/spaces/index.htm awk -fsecond.awk /www/jno/spaces/index.htm cat foo hits bar > /www/jno/spaces/index.htm rm foo barfirst.awk and second.awk as follows:
#read all of index file up to -=, prints to foo # should run as "awk -ffirst.awk index.htm /<\!DOCTYPE/, /\-\=/ { print > "foo" }This assumes that the index.htm file starts with "<!DOCTYPE" - if not, you could awk from "<html>"
#read all of index file from =- to end, prints to bar # should run as "awk -fsecond.awk index.htm /\=\-/, /<\/html>/ { print > "bar" }This (again) assumes that the index.htm file ends with "</html>", that is, in lower case. Better check.
[Tags overview] [Page design] (File organization) [Tools]