Wget Mirror Websites

		GNU/Linux Desktop Survival Guide by Graham Williams

CLICK HERE TO VISIT THE UPDATED SURVIVAL GUIDE

Wget Mirror Websites

20200526 A popular use case for wget is to make a complete copy of a website, perhaps for local perusal or local archival. For example, we might backup a conference website for archival and historical purposes:

$ wget --mirror --convert-links --adjust-extension --page-requisites \
  --no-parent https://ausdm18.ausdm.org/

This will create a directory called ausdm18.ausdm.org in the current working directory. Browsing to this directory within a browser using a URL like file:///home/kayon/ausdm18.ausdm.org will interact with the local copy of the web site.

Another use case is to download all of the available Debian packages that start with r as available from a particular Debian mirror.

  $ wget --mirror --accept '.deb' --no-directories \
    http://archive.ubuntu.com/ubuntu/ubuntu/pool/main/r/

Useful comman line options include -r (--recursive) which indicates that we want to recurse through the given URL link. The --mirror option includes --recursive as well as some other options (see the manual page for details). The -l 1 (--level=1) option specifies how many levels we should dive into at the web site. Here we recurse only a single level. The -A .deb (--accept) resticts the download to just those files the have a deb extension. The extenstions can be a comma separated list. The -nd (--no-directories) requests wget to not create any directories locally—the files are downloaded to the current directory.

For a website that no longer exists, the wayback machine is useful. To copy a website from there, install the wayback machine downloader and then:

$ wayback_machine_downloader http://ausdm17.azurewebsites.net/

Unlike wget, fixed links are not updated to be internally consistent. That will need to be done by hand.

Support further development by purchasing the PDF version of the book.
Other online resources include the Data Science Desktop Survival Guide.
Books available on Amazon include Data Mining with Rattle and Essentials of Data Science.
Popular open source software includes rattle and wajig.
Hosted by Togaware, a pioneer of free and open source software since 1984.
Copyright © 1995-2020 Togaware Pty Ltd. Creative Commons ShareAlike V4.