I had a set of xml files, each file contained a url in <url></url> and I need a way to check if the URL was valid. I came up with the following bash script
#!/bin/bash
for e in `find . -type f`; do
#strip the the junk before and after to get a clean url
url=`grep url $e|sed 's/.*<url>//'|sed 's/.<\/url>//'`
# i put wget output to a file - stdout and stderr to a temp file
wget -nv --spider $url > tmp.file 2>&1
# if it was good, then a grep count should return 0
# bad files will get added to a new delete script
# good files go to a good list - this way we can count the
# results with wc -l
if [ "`grep -c 404 tmp.file`" != "0" ]; then
echo "rm -f $e" >> delete.lst
else
echo "$e" >> good.lst
fi
Russ
No comments:
Post a Comment