HTML Conversions
Converting a html file to plain text
The following shell script will convert any html file to plain text and maintain the links as a list at the end of the text file. It assumes that the *.htm or *html files to be converted are in the directory where the script (h2t) is executed.
#!/bin/sh # h2t, convert all htm and html files of a directory to text for file in `ls *.htm` do new=`basename $file htm` lynx -dump $file > ${new}txt done ##### for file in `ls *.html` do new=`basename $file html` lynx -dump $file > ${new}txt doneI did not send the error messages from ls to /dev/null, so if a file is not found, you will get a screen message "file not found". To have all the internal links referenced by a list at the end of the text file, you will need to set Lynx up correctly.
Converting a plain text file to html
This is a sed script. If the text is reasonably formatted, a fully useable html file will result. As set up below the script (t2h) will do the following..
- substitute the word " at " for any " at " signs
- remove any 8-bit characters
- reduce lines with tabs and blanks to no space at all
- remove duplicate blank lines (leaving one between parapgraphs)
- place /UL P on remaining blank lines (paragraphing)
- remove the line breaks after the /UL P (ends up on next line)
- indent any Paragraphs (not lines) starting with a quote mark (and removes leading tabs and spaces)
- introduce a BR tag on any line starting with a hyphen (and removes leading tabs and spaces)
- convert http://URL to a link
#!/bin/sh # t2h {$1} html-ize a text file and save as foo.htm NL=" " cat $1 \ | sed -e 's/ at / at /g' \ | sed -e 's/[[:cntrl:]]/ /g'\ | sed -e 's/^[[:space:]]*$//g' \ | sed -e '/^$/{'"$NL"'N'"$NL"'/^\n$/D'"$NL"'}' \ | sed -e 's/^$/<\/UL><P>/g' \ | sed -e '/<P>$/{'"$NL"'N'"$NL"'s/\n//'"$NL"'}'\ | sed -e 's/<P>[[:space:]]*"/<P><UL>"/' \ | sed -e 's/^[[:space:]]*-/<BR> -/g' \ | sed -e 's/http:\/\/[[:graph:]\.\/]*/<A HREF="&">[&]<\/A> /g'\ > foo.htmObviously the HEAD section of the html file, or the enclosing BODY and HTML tags are not written (they are also not required under HTML-4). These could be added with an additional line to the script, like..
cat header foo.htm tail > bar.htmAdditionally, you could include other HTML tags within the original text - as long as you do not use something which the sed script would alter.
Website Provider: Outflux.net, www.Outflux.net
URL:http://jnocook.net/geek/htm.htm