Tuesday, August 30, 2011

wget is my friend

sometimes you need yank a site and edit it and really don't feel like doing it live. one way to get the content is cp'ing the directory and do whatever to it. other times wget is a quick and dirty if you're particularly lazy or if your content is all over the place. here's the command line i typically use for wget:
wget --random-wait -r -p -e robots=off -U mozilla http://your.site.org <- yanks *
wget --random-wait -r --no-parent -p http://your.site.org/content/dir <- grabs a dir

some useful swtiches per man:
-p ; include all files, including images.
-e robots=off ; do not obey server-side robots.txt 
-U mozilla ; browser identity.
--random-wait ; number of seconds to wait, thus avoiding server black list.
--limit-rate=15k ; throttle the download rate.
-b ; continue application in background.
-o ; output log (as opposed to scrolling on screen).

No comments: