Saturday 13 May 2017

All the Wget Commands You Should Know

How do I download an entire website for offline viewing? How do I save all the MP3s from a website to a folder on my computer? How do I download files that are behind a login page? How do I build a mini-version of Google?

Wget is a free utility – available for  MacWindows and Linux (included) – that can help you accomplish all this and more. What makes it different from most download managers is that wget can follow the HTML links on a web page and recursively download the files. It is the  same tool that a soldier had used to download thousands of secret documents from the US army’s Intranet that were later published on the Wikileaks website.


SPIDER WEBSITES WITH WGET – 20 PRACTICAL EXAMPLES

Wget is extremely powerful, but like with most other command line programs, the plethora of options it supports can be intimidating to new users. Thus what we have here are a collection of wget commands that you can use to accomplish common tasks from downloading single files to mirroring entire websites. It will help if you can read through the wget manual but for the busy souls, these commands are ready to execute.

1. Download a single file from the Internet
wget http://example.com/file.iso

2. Download a file but save it locally under a different name
wget ‐‐output-document=filename.html example.com

3. Download a file and save it in a specific folder
wget ‐‐directory-prefix=folder/subfolder example.com

4. Resume an interrupted download previously started by wget itself
wget ‐‐continue example.com/big.file.iso

5. Download a file but only if the version on server is newer than your local copy
wget ‐‐continue ‐‐timestamping wordpress.org/latest.zip

6. Download multiple URLs with wget. Put the list of URLs in another text file on separate lines and pass it to wget.
wget ‐‐input list-of-file-urls.txt

7. Download a list of sequentially numbered files from a server
wget http://example.com/images/{1..20}.jpg

8. Download a web page with all assets – like stylesheets and inline images – that are required to properly display the web page offline.
wget ‐‐page-requisites ‐‐span-hosts ‐‐convert-links ‐‐adjust-extension http://example.com/dir/file

MIRROR WEBSITES WITH WGET


9. Download an entire website including all the linked pages and files
wget ‐‐execute robots=off ‐‐recursive ‐‐no-parent ‐‐continue ‐‐no-clobber http://example.com/

10. Download all the MP3 files from a sub directory
wget ‐‐level=1 ‐‐recursive ‐‐no-parent ‐‐accept mp3,MP3 http://example.com/mp3/

11. Download all images from a website in a common folder
wget ‐‐directory-prefix=files/pictures ‐‐no-directories ‐‐recursive ‐‐no-clobber ‐‐accept jpg,gif,png,jpeg http://example.com/images/

12. Download the PDF documents from a website through recursion but stay within specific domains.
wget ‐‐mirror ‐‐domains=abc.com,files.abc.com,docs.abc.com ‐‐accept=pdf http://abc.com/

13. Download all files from a website but exclude a few directories.
wget ‐‐recursive ‐‐no-clobber ‐‐no-parent ‐‐exclude-directories /forums,/support http://example.com


WGET FOR DOWNLOADING RESTRICTED CONTENT


Wget can be used for downloading content from sites that are behind a login screen or ones that check for the HTTP referer and the User Agent strings of the bot to prevent screen scraping.

14. Download files from websites that check the User Agent and the HTTP Referer
wget ‐‐refer=http://google.com ‐‐user-agent=”Mozilla/5.0 Firefox/4.0.1″ http://nytimes.com

15. Download files from a password protected sites
wget ‐‐http-user=way2trick ‐‐http-password=hello123 http://example.com/secret/file.zip


16. Fetch pages that are behind a login page. You need to replace user and password with the actual form fields while the URL should point to the Form Submit (action) page.
wget ‐‐cookies=on ‐‐save-cookies cookies.txt ‐‐keep-session-cookies ‐‐post-data ‘user=way2trick&password=123’ http://example.com/login.php
wget ‐‐cookies=on ‐‐load-cookies cookies.txt ‐‐keep-session-cookies http://example.com/paywall


RETRIEVE FILE DETAILS WITH WGET


17. Find the size of a file without downloading it (look for Content Length in the response, the size is in bytes)
wget ‐‐spider ‐‐server-response http://example.com/file.iso

18. Download a file and display the content on screen without saving it locally.
wget ‐‐output-document – ‐‐quiet google.com/humans.txt



19. Know the last modified date of a web page (check the Last Modified tag in the HTTP header).
wget ‐‐server-response ‐‐spider http://www.way2trick.blogspot.in/

20. Check the links on your website to ensure that they are working. The spider option will not save the pages locally.
wget ‐‐output-file=logfile.txt ‐‐recursive ‐‐spider http://example.com


WGET – HOW TO BE NICE TO THE SERVER?

The wget tool is essentially a spider that scrapes / leeches web pages but some web hosts may block these spiders with the robots.txt files. Also, wget will not follow links on web pages that use the rel=nofollow attribute.

You can however force wget to ignore the robots.txt and the nofollow directives by adding the switch ‐‐execute robots=off to all your wget commands. If a web host is blocking wget requests by looking at the User Agent string, you can always fake that with the ‐‐user-agent=Mozilla switch.

The wget command will put additional strain on the site’s server because it will continuously traverse the links and download files. A good scraper would therefore limit the retrieval rate and also include a wait period between consecutive fetch requests to reduce the server load.

wget ‐‐limit-rate=20k ‐‐wait=60 ‐‐random-wait ‐‐mirror example.com

In the above example, we have limited the download bandwidth rate to 20 KB/s and the wget utility will wait anywhere between 30s and 90 seconds before retrieving the next resource.

0 comments:

Post a Comment

Related Posts Plugin for WordPress, Blogger...

Labels

404 AdBlock Add-on Airtel GPRS Trick Airtel SMS Trick Alexa Amazon Amazon Kindle Amazon Prime Android Android 8 Android Oreo antivirus Apple Apple Mac ASCII Audacity Audio Authotkey Backup Balance Transfer in Vodafone Battery Bing Blogger Blogging Bookmarklet Browser Camera Chromebook clock Cloud colors command lines Computer Computer Tricks configuration Contact Creative Commons Credit Card CSS devolop DIY Doodle DOS Download Dropbox E-Mail eBook Email Email Attachment Embed Encryption English Error Evernote Eyes Facebook Facebook Tricks Feedburner Flipkart Font Foursquare Free Internet Free sms trick in Vodafone G Mail Gadget Game Getty Images GIF Gists Github Google Google AdSense Google Analytics Google Apps Google Chrome Google Contacts Google Currents Google DNS Google Docs Google Drive Google Earth Google Font Google Forms Google Images Google Map Google Photos Google Play Store Google Plus Google Print Google Reader Google Script Google Sheets Google Spreadsheet Google Translate GPRS Setting GPS Hacking Health App HelloFax Hindi Hoodie HTML Icons idea Image Editing Images IMEI Indian Railways Infographics Instagram Internet Internet Explorer Internet Tricks iOS iPad iPhone IRCTC iTunes iTV JavaScript JioCinema JioTV Junglee Kindle Language Translation Laptop Laptop. TV Life Time FREE GPRS Life-Style Link Linkedln Linux logo Make Money Online Microdoft Powerpoint Microdoft Word Microsoft Office Microsoft Outlook Mobile Mosaic Music Name Networking nexus Notepad OCR Online Shopping Open DNS OS Outlook Password PDF Petya Phillips Hue Lights Photogtraphy Pixel Play Station Podcasts Pokemon Pokemon Go Polls Print Productivity Proxy Server Pushbullet QR Code Ransomware Reddit Reliance Hack GPRS Reliance Jio RGB Ringtone Router RSS Safe Mode Samsung Galaxy S Scrabble Screen Capture Screen Sharing Screencast Secrets Security Send free sms from PC SEO Sierra Skype Slideshare SMBv1 SMS Snapchat Snapdeal Social Media Solution Sound Device Speech Recognition Sql Steam Sync Synology NAS Tata Docomo GPRS trick Teleprompter Torrent Trick Tricks TV Twitter UltraISO Unicode Unknown Extension Unlimited 2GB Unlimited 3GB Unlimited GPRS USB USB Security Key Video Editing virtual desktop Virus attack VLC Vodafone 110% working trick for GPRS Vodafone 3g Vodafone GPRS VPN wallpapers WannaCry Web Design Web Domain Website Wget Whatsapp WiFi Wikipedia Windows Windows 10 Windows 10 S Windows KN Windows Tricks windows updates Winows N Wolfarm Alpha WordPress XBox YouTube Zip
Twitter Delicious Facebook Digg Stumbleupon Favorites More