41.9 Wget UserAgent Browser Identification


Some sites will check whether a browser is being identified to download and if not they will return a response. This is to prevent the burden of automated programs using the site’s bandwidth. By overriding this we are placing a burden on the websites owner. They may also employ other mechanisms to identify robots and block appropriately. They may even decide to block your IP address transiently or even permanently! So do consider this before deciding to override the website owner’s choices.

Programs and the command line wget typically may not report a UserAgent to the website from which they are connecting, or they may report accurately that they are wget, for example.

For wget, for example, the reported UserAgent can be changed to avoid the 403 error with -U or --user-agent:

$ wget -U "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" https://example.com/paper.pdf

Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0