Thursday, November 5, 2009

Using wget to check if files exist

I had a set of xml files, each file contained a url in <url></url> and I need a way to check if the URL was valid. I came up with the following bash script


#!/bin/bash
for e in `find . -type f`; do

#strip the the junk before and after to get a clean url
url=`grep url $e|sed 's/.*<url>//'|sed 's/.<\/url>//'`

# i put wget output to a file - stdout and stderr to a temp file
wget -nv --spider $url > tmp.file 2>&1

# if it was good, then a grep count should return 0
# bad files will get added to a new delete script
# good files go to a good list - this way we can count the
# results with wc -l

if [ "`grep -c 404 tmp.file`" != "0" ]; then
  echo "rm -f $e" >> delete.lst
else
  echo "$e" >> good.lst
fi

Russ

No comments: