Вы находитесь на странице: 1из 7

wget

Wget is a utility designed for retrieving binary documents across the Web, through the use of HTTP (Hyper Text Transfer Protocol) and FTP (File Transfer Protocol), and saving them to disk. Wget is non-interactive, which means it can work in the background, while the user is not logged in, unlike most of web browsers (thus you may start the program and log off, letting it do its work). Analyzing server responses, it distinguishes between correctly and incorrectly retrieved documents, and retries retrieving them as many times as necessary, or until a user-specified limit is reached. REST is used in FTP on hosts that support it. Proxy servers are supported to speed up the retrieval and lighten network load. Wget supports a full-featured recursion mechanism, through which you can retrieve large parts of the web, creating local copies of remote directory hierarchies. Of course, maximum level of recursion and other parameters can be specified. Infinite recursion loops are always avoided by hashing the retrieved data. All of this works for both HTTP and FTP. The retrieval is conveniently traced with printing dots, each dot representing one kilobyte of data received. Builtin features offer mechanisms to tune which links you wish to follow (cf. -L, -D and -H).

URL CONVENTIONS
Most of the URL conventions described in RFC1738 are supported. Two alternative syntaxes are also supported, which means you can use three forms of address to specify a file: Normal URL (recommended form): http://host[:port]/path http://fly.cc.fer.hr/ ftp://ftp.xemacs.org/pub/xemacs/xemacs-19.14.tar.gz ftp://username:password@host/dir/file FTP only (ncftp-like): hostname:/dir/file HTTP only (netscape-like): hostname(:port)/dir/file You may encode your username and/or password to URL using the form: ftp://user:password@host/dir/file If you do not understand these syntaxes, just use the plain ordinary syntax with which you would call lynx or netscape. Note that the alternative forms are deprecated, and may cease being supported in the future.

OPTIONS
There are quite a few command-line options for wget. Note that you do not have to know or to use them unless you wish to change the default behaviour of the program. For simple operations you need no options at all. It is also a good idea to put frequently used command-line options in .wgetrc, where they can be stored in a more readable form. This is the complete list of options with descriptions, sorted in descending order of importance: -h --help Print a help screen. You will also get help if you do not supply command-line arguments. -V --version Display version of wget. -v --verbose Verbose output, with all the available data. The default output consists only of saving updates and error messages. If the output is stdout, verbose is default. -q --quiet Quiet mode, with no output at all. -d --debug Debug output, and will work only if wget was compiled with -DDEBUG. Note that when the program is compiled with debug output, it is not printed unless you specify -d. -i filename --input-file=filename

Read URL-s from filename, in which case no URL-s need to be on the command line. If there are URL-s both on the command line and in a filename, those on the command line are first to be retrieved. The filename need not be an HTML document (but no harm if it is) - it is enough if the URL-s are just listed sequentially. However, if you specify --force-html, the document will be regarded as HTML. In that case you may have problems with relative links, which you can solve either by adding to the document or by specifying --base=url on the command-line. -o logfile --output-file=logfile Log messages to logfile, instead of default stdout. Verbose output is now the default at logfiles. If you do not wish it, use -nv (non-verbose). -a logfile --append-output=logfile Append to logfile - same as -o, but appends to a logfile (or creating a new one if the old does not exist) instead of rewriting the old log file. -t num --tries=num Set number of retries to num. Specify 0 for infinite retrying. --follow-ftp Follow FTP links from HTML documents. -c --continue-ftp Continue retrieval of FTP documents, from where it was left off. If you specify "wget -c ftp://sunsite.doc.ic.ac.uk/ls-lR.Z", and there is already a file named ls-lR.Z in the current directory, wget continue retrieval from the offset equal to the length of the existing file. Note that you do not need to specify this option if the only thing you want is wget to continue retrieving where it left off when the connection is lost wget does this by default. You need this option when you want to continue retrieval of a file already halfway retrieved, saved by other FTP software, or left by wget being killed. -g on/off --glob=on/off Turn FTP globbing on or off. By default, globbing will be turned on if the URL contains a globbing characters (an asterisk, e.g.). Globbing means you may use the special characters (wildcards) to retrieve more files from the same directory at once, like wget ftp://gnjilux.cc.fer.hr/*.msg. Globbing currently works only on UNIX FTP servers. -e command --execute=command Execute command, as if it were a part of .wgetrc file. A command invoked this way will take precedence over the same command in .wgetrc, if there is one. -N --timestamping Use the so-called time-stamps to determine whether to retrieve a file. If the last-modification date of the remote file is equal to, or older than that of local file, and the sizes of files are equal, the remote file will not be retrieved. This option is useful for weekly mirroring of HTTP or FTP sites, since it will not permit downloading of the same file twice. -F --force-html When input is read from a file, force it to be HTML. This enables you to retrieve relative links from existing HTML files on your local disk, by adding to HTML, or using --base. -B base href --base=base href Use base href as base reference, as if it were in the file, in the form . Note that the base in the file will take precedence over the one on the command-line. -r --recursive Recursive web-suck. According to the protocol of the URL, this can mean two things. Recursive retrieval of a HTTP URL means that Wget will download the URL you want, parse it as an HTML document (if an HTML document it is), and retrieve the files this document is referring to, down to a certain depth (default 5; change it with -l). Wget will create a hierarchy of directories locally, corresponding to the one found on the HTTP server. This option is ideal for presentations, where slow connections should be bypassed. The results will be especially good if relative links were used, since the pages will then work on the new location without change. When using this option with an FTP URL, it will retrieve all the data from the given directory and subdirectories, similar to HTTP recursive retrieval. You should be warned that invoking this option may cause grave overloading of your connection. The load can be minimized by lowering the maximal recursion level (see -l) and/or by lowering the number of retries (see -t). -m --mirror Turn on mirroring options. This will set recursion and time-stamping, combining -r and -N. -l depth --level=depth Set recursion depth level to the specified level. Default is 5. After the given recursion level is reached, the sucking will proceed from the parent. Thus specifying -r -l1 should equal a recursion-less retrieve from file.

Setting the level to zero makes recursion depth (theoretically) unlimited. Note that the number of retrieved documents will increase exponentially with the depth level. -H --span-hosts Enable spanning across hosts when doing recursive retrieving. See -r and -D. Refer to FOLLOWING LINKS for a more detailed description. -L --relative Follow only relative links. Useful for retrieving a specific homepage without any distractions, not even those from the same host. Refer to FOLLOWING LINKS for a more detailed description. -D domain-list --domains=domain-list Set domains to be accepted and DNS looked-up, where domain-list is a comma-separated list. Note that it does not turn on -H. This speeds things up, even if only one host is spanned. Refer to FOLLOWING LINKS for a more detailed description. -A acclist / -R rejlist --accept=acclist / --reject=rejlist Comma-separated list of extensions to accept/reject. For example, if you wish to download only GIFs and JPEGs, you will use -A gif,jpg,jpeg. If you wish to download everything except cumbersome MPEGs and .AU files, you will use -R mpg,mpeg,au. -X list --exclude-directories list Comma-separated list of directories to exclude from FTP fetching. -P prefix --directory-prefix=prefix Set directory prefix ("." by default) to prefix. The directory prefix is the directory where all other files and subdirectories will be saved to. -T value --timeout=value Set the read timeout to a specified value. Whenever a read is issued, the file descriptor is checked for a possible timeout, which could otherwise leave a pending connection (uninterrupted read). The default timeout is 900 seconds (fifteen minutes). -Y on/off --proxy=on/off Turn proxy on or off. The proxy is on by default if the appropriate environmental variable is defined. -Q quota[KM] --quota=quota[KM] Specify download quota, in bytes (default), kilobytes or megabytes. More useful for rc file. See below. -O filename --output-document=filename The documents will not be written to the appropriate files, but all will be appended to a unique file name specified by this option. The number of tries will be automatically set to 1. If this filename is `-', the documents will be written to stdout, and --quiet will be turned on. Use this option with caution, since it turns off all the diagnostics Wget can otherwise give about various errors. -S --server-response Print the headers sent by the HTTP server and/or responses sent by the FTP server. -s --save-headers Save the headers sent by the HTTP server to the file, before the actual contents. --header=additional-header Define an additional header. You can define more than additional headers. Do not try to terminate the header with CR or LF. --http-user --http-passwd Use these two options to set username and password Wget will send to HTTP servers. Wget supports only the basic WWW authentication scheme. -nc Do not clobber existing files when saving to directory hierarchy within recursive retrieval of several files. This option is extremely useful when you wish to continue where you left off with retrieval. If the files are .html or (yuck) .htm, it will be loaded from the disk, and parsed as if they have been retrieved from the Web. -nv Non-verbose - turn off verbose without being completely quiet (use -q for that), which means that error messages and basic information still get printed. -nd Do not create a hierarchy of directories when retrieving recursively. With this option turned on, all files will get saved to the current directory, without clobbering (if a name shows up more than once, the filenames will get extensions .n). -x The opposite of -nd -- Force creation of a hierarchy of directories even if it would not have been done otherwise. -nh

Disable time-consuming DNS lookup of almost all hosts. Refer to FOLLOWING LINKS for a more detailed description. -nH Disable host-prefixed directories. By default, http://fly.cc.fer.hr/ will produce a directory named fly.cc.fer.hr in which everything else will go. This option disables such behaviour. --no-parent Do not ascend to parent directory. -k --convert-links Convert the non-relative links to relative ones locally.

FOLLOWING LINKS
Recursive retrieving has a mechanism that allows you to specify which links wget will follow. Only relative links When only relative links are followed (option -L), recursive retrieving will never span hosts. will never get called, and the process will be very fast, with the minimum strain of the network. This will suit your needs most of the time, especially when mirroring the output the output of *2html converters, which generally produce only relative links. Host checking The drawback of following the relative links solely is that humans often tend to mix them with absolute links to the very same host, and the very same page. In this mode (which is the default), all URL-s that refer to the same host will be retrieved. The problem with this options are the aliases of the hosts and domains. Thus there is no way for wget to know that regoc.srce.hr and www.srce.hr are the same hosts, or that fly.cc.fer.hr is the same as fly.cc.etf.hr. Whenever an absolute link is encountered, gethostbyname is called to check whether we are really on the same host. Although results of gethostbyname are hashed, so that it will never get called twice for the same host, it still presents a nuisance e.g. in the large indexes of difference hosts, when each of them has to be looked up. You can use -nh to prevent such complex checking, and then wget will just compare the hostname. Things will run much faster, but also much less reliable. Domain acceptance With the -D option you may specify domains that will be followed. The nice thing about this option is that hosts that are not from those domains will not get DNS- looked up. Thus you may specify -Dmit.edu, just to make sure that nothing outside .mit.edu gets looked up . This is very important and useful. It also means that -D does not imply -H (it must be explicitly specified). Feel free to use this option, since it will speed things up greatly, with almost all the reliability of host checking of all hosts. Of course, domain acceptance can be used to limit the retrieval to particular domains, but freely spanning hosts within the domain, but then you must explicitly specify -H. All hosts When -H is specified without -D, all hosts are being spanned. It is useful to set the recursion level to a small value in those cases. Such option is rarely useful. FTP The rules for FTP are somewhat specific, since they have to be. To have FTP links followed from HTML documents, you must specify -f (follow_ftp). If you do specify it, FTP links will be able to span hosts even if span_hosts is not set. Option relative_only (-L) has no effect on FTP. However, domain acceptance (-D) and suffix rules (-A/-R) still apply.

STARTUP FILE
Wget supports the use of initialization file .wgetrc. First a system-wide init file will be looked for (/usr/local/lib/wgetrc by default) and loaded. Then the user's file will be searched for in two places: In the environmental variable WGETRC (which is presumed to hold the full pathname) and $HOME/.wgetrc. Note that the settings in user's startup file may override the system settings, which includes the quota settings (he he). The syntax of each line of startup file is simple: variable = value Valid values are different for different variables. The complete set of commands is listed below, the letter after equation-sign denoting the value the command takes. It is on/off for on or off (which can also be 1 or 0), string for any string or N for positive integer. For example, you may specify "use_proxy = off" to disable use of proxy servers by default. You may use inf for infinite value (the role of 0 on the command line), where appropriate.

The commands are case-insensitive and underscore-insensitive, thus DIr_Prefix is the same as dirprefix. Empty lines, lines consisting of spaces, or lines beginning with '#' are skipped. Most of the commands have their equivalent command-line option, except some more obscure or rarely used ones. A sample init file is provided in the distribution, named sample.wgetrc. accept/reject = string Same as -A/-R. add_hostdir = on/off Enable/disable host-prefixed hostnames. -nH disables it. always_rest = on/off Enable/disable continuation of the retrieval, the same as -c. base = string Set base for relative URL-s, the same as -B. convert links = on/off Convert non-relative links locally. The same as -k. debug = on/off Debug mode, same as -d. dir_mode = N Set permission modes of created subdirectories (default is 755). dir_prefix = string Top of directory tree, the same as -P. dirstruct = on/off Turning dirstruct on or off, the same as -x or -nd, respectively. domains = string Same as -D. follow_ftp = on/off Follow FTP links from HTML documents, the same as -f. force_html = on/off If set to on, force the input filename to be regarded as an HTML document, the same as -F. ftp_proxy = string Use the string as FTP proxy, instead of the one specified in environment. glob = on/off Turn globbing on/off, the same as -g. header = string Define an additional header, like --header. http_passwd = string Set HTTP password. http_proxy = string Use the string as HTTP proxy, instead of the one specified in environment. http_user = string Set HTTP user. input = string Read the URL-s from filename, like -i. kill_longer = on/off Consider data longer than specified in content-length header as invalid (and retry getting it). The default behaviour is to save as much data as there is, provided there is more than or equal to the value in content-length. logfile = string Set logfile, the same as -o. login = string Your user name on the remote machine, for FTP. Defaults to "anonymous". mirror = on/off Turn mirroring on/off. The same as -m. noclobber = on/off Same as -nc. no_parent = on/off Same as --no-parent.

no_proxy = string Use the string as the comma-separated list of domains to avoid in proxy loading, instead of the one specified in environment. num_tries = N Set number of retries per URL, the same as -t. output_document = string Set the output filename, the same as -O. passwd = string Your password on the remote machine, for FTP. Defaults to username@hostname.domainname. quiet = on/off Quiet mode, the same as -q. quota = quota Specify the download quota, which is useful to put in /usr/local/lib/wgetrc. When download quota is specified, wget will stop retrieving after the download sum has become greater than quota. The quota can be specified in bytes (default), kbytes ('k' appended) or mbytes ('m' appended). Thus "quota = 5m" will set the quota to 5 mbytes. Note that the user's startup file overrides system settings. reclevel = N Recursion level, the same as -l. recursive = on/off Recursive on/off, the same as -r. relative_only = on/off Follow only relative links (the same as -L). Refer to section FOLLOWING LINKS for a more detailed description. robots = on/off Use (or not) robots.txt file. server_response = on/off Choose whether or not to print the HTTP and FTP server responses, the same as -S. simple_host_check = on/off Same as -nh. span_hosts = on/off Same as -H. timeout = N Set timeout value, the same as -T. timestamping = on/off Turn timestamping on/off. The same as -N. use_proxy = on/off Turn proxy support on/off. The same as -Y. verbose = on/off Turn verbose on/off, the same as -v/-nv.

SIGNALS
Wget will catch the SIGHUP (hangup signal) and ignore it. If the output was on stdout, it will be redirected to a file named wget-log_. This is also convenient when you wish to redirect the output of Wget interactively. $ wget http://www.ifi.uio.no/~larsi/gnus.tar.gz & $ kill -HUP %% # to redirect the output Wget will not try to handle any signals other than SIGHUP. Thus you may interrupt Wget using ^C or SIGTERM. EXAMPLES Get URL http://fly.cc.fer.hr/: wget http://fly.cc.fer.hr/ Force non-verbose output: wget -nv http://fly.cc.fer.hr/ Unlimit number of retries: wget -t0 http://www.yahoo.com/

Create a mirror image of fly's web (with the same directory structure the original has), up to six recursion levels, with only one try per document, saving the verbose output to log file 'log': wget -r -l6 -t1 -o log http://fly.cc.fer.hr/ Retrieve from yahoo host only (depth 50): wget -r -l50 http://www.yahoo.com/

ENVIRONMENT
http_proxy, ftp_proxy, no_proxy, WGETRC, HOME

FILES
/usr/local/lib/wgetrc, $HOME/.wgetrc

UNRESTRICTIONS
Wget is free; anyone may redistribute copies of Wget to anyone under the terms stated in the General Public License, a copy of which accompanies each copy of Wget.

SEE ALSO
lynx(1), ftp(1)

AUTHOR
Hrvoje Niksic is the author of Wget. Thanks to the beta testers and all the other people who helped with useful suggestions.

Вам также может понравиться