Preferences

To customize the download, click the Preferences button in the SiteSucker window. This displays the Preferences dialog, which allows you to exercise more control over what is actually retrieved by SiteSucker.

Options

The Options tab provides the following preferences:

Log Errors

Check this box to log any errors that might occur during the download. This information is written to the SiteSucker Log file, which is stored in the Download Folder.

Log Warnings

Check this box to log any warnings that might occur during the download. This information is written to the SiteSucker Log file, which is stored in the Download Folder. SiteSucker will log warnings for any pages that use JavaScript or for any files that could not be downloaded because they were disallowed by robots.txt or the Robots META tag.

Log Download History

Check this box to log the URL of every file downloaded. This information is written to the SiteSucker Log file, which is stored in the Download Folder.

Check All Links

Check this box to have SiteSucker check all links in all downloaded HTML files — including links to files that you are not downloading — and log any errors that occur. (The Log Errors option is checked automatically when this option is selected.) With this option turned on, SiteSucker will report many errors that you normally wouldn't see. This preference is intended as a debugging tool for Web designers who want to see if their own sites have any bad links.

Localize HTML

Check this box to "localize" downloaded files so that you will get the best results when browsing them offline. This feature modifies the downloaded HTML documents by replacing every link to a file on a Web server with the corresponding link to the local file.

If there isn't a local file associated with a link, this preference will ensure that the link points to the file on the Web server. For example, if you are only downloading a single HTML document (i.e., the Limit to Level preference is set to 1), then SiteSucker will convert any relative links on the page to absolute links for the original site. This is done to eliminate missing images when the downloaded page is viewed in a Web browser.

Drag Triggers Download

Check this box to have SiteSucker automatically start the download after you drag a URL into the SiteSucker window. (For more information on SiteSucker's support for drag-and-drop, see Web URL.)

Ignore Robot Exclusions

Check this box to have SiteSucker ignore robots.txt exclusions and the Robots META tag.

Warning: Ignoring robot exclusions is not recommended. Robot exclusions are usually put in place for a good reason and should be obeyed.

By default, SiteSucker honors robots.txt exclusions and the Robots META tag. The robots.txt file allows the Web site administrator to define what parts of a site are off-limits to specific robots, like SiteSucker. Web administrators can disallow access to cgi and private and temporary directories, for example, because they do not want pages in those areas downloaded. In addition to server-wide robot control using robots.txt, Web page creators can also use the Robots META tag to specify that the links on a page should not be followed by robots.

Ambiguous URLs Are Files

Check this box to have SiteSucker treat ambiguous URLs as files. If a URL does not end with a '/' and the last path component does not have a file extension, SiteSucker considers it to be ambiguous. When this option is off, SiteSucker adds a '/' to the end of ambiguous URLs.

Identity

Use this control to customize the way SiteSucker identifies itself when making a request. Some sites are very particular about which browsers they will allow. You can you use this feature to "fool" the site into thinking that you are using an approved browser. To change SiteSucker's identity, simply click on this control and select one of the Web browsers listed. (If you choose "None", SiteSucker will not include any identifying information when making requests.) If the browser that you want isn't listed, you can add it by selecting the "Customize..." menu item. This will display the Customize Identities dialog.

To add a new identity, select "New Identity" in the pop-up menu, enter the Web browser name and the appropriate user-agent string for the browser, and click the Add button. To modify one of the existing identities, select it in the pop-up menu, change the Web browser name or the user-agent string for the browser, and click the Replace button. To delete one of the existing identities, select it in the pop-up menu and click the Delete button. To restore the original identities, click the Defaults button.

Replace Files

Use this control to specify when SiteSucker should replace existing files. You can either choose Never, Always, or With Newer.

If "Never" is selected, SiteSucker will never replace your local files and will only download those files that haven't already been downloaded.

If "Always" is selected, SiteSucker will always delete your local files and replace them with files downloaded from the Internet.

If "With Newer" is selected, SiteSucker will only replace existing files if a newer copy is found on the Internet. Specifically, when SiteSucker analyzes a site, it will download header information for each file to check its "last modified" date, but it won't download the file itself unless the file on the server is newer than your local copy.

Site Login Dialog

Use this control to specify when SiteSucker should display the Site Login dialog. For details on how to use this preference, see Downloading Password-protected Sites.

Download Folder

Use this control to select the local Download Folder where files will be saved. By default, the Download Folder is the folder that contains the SiteSucker application. To change the Download Folder, click on this control and select the "Set Download Folder..." menu item. This will display a dialog box. Select a folder in the dialog box and click the Choose button. If you select the "Ask Before Downloading" menu item, SiteSucker will let you choose the Download Folder when you start a download.

Limits

The Limits tab provides the following preferences:

Limit to Level

Use this control to limit the number of levels to recursively download. For example, if you limit downloads to level 2, SiteSucker will only download the initial file and any links found in that file.

Limit to Directory

Use this control to limit downloaded files to those at a specific site, those within a specific directory, or those containing a specific path. You can choose either No Limit, Web URL Host, Web URL Directory, or Paths Preferences.

Suppose you clicked the Download button after entering the following address in the Web URL text box:

http://www.xyz.com/something/anotherthing/main.html

If you had chosen the No Limit option, SiteSucker would try to download this file and every file at www.xyz.com that it links to and every site that the www.xyz.com site links to and every site that these other sites link to, etc. This could result in a HUGE download if allowed to continue forever.

If you had chosen the Web URL Host option, only those files and directories on the www.xyz.com site would be downloaded.

If you had chosen the Web URL Directory option, only those files and directories within the www.xyz.com/something/anotherthing directory would be downloaded.

If you had chosen the Paths Preferences option, SiteSucker would download the main.html page and any files and directories which it references that have paths that are included in the Paths preferences.

Minimum File Size

Use this control to specify the smallest data file that SiteSucker will download. You can use this feature to keep SiteSucker from downloading Web links, banners, thumbnails, and other small files. This setting does not affect HTML files, which are always downloaded regardless of their size.

Download Attempts

Use this control to select the number of times SiteSucker should try to download a file.

Download Timeout

Use this control to select the length of time that SiteSucker should wait for a response from the server.

Download Delay

Use this control to specify the length of time that SiteSucker should delay before it downloads a file. This feature can allow you to download sites while using very little bandwidth and can help avoid anti-mining safeguards employed by some sites. The delay can be set to None or to a fixed range of values (e.g., 20 - 40 sec). If you select None, SiteSucker downloads the site as quickly as possible. If you select a delay range, SiteSucker will add a random delay (within the selected range) before it downloads a file. Furthermore, if a delay is specified, SiteSucker will only use a single active connection to download files since the whole purpose of using multiple connections is to reduce delays.

File Types

The File Types tab provides the following preferences:

Download all file types

Select this control to have SiteSucker download all files regardless of file type.

Only download files with these extensions

Select this control if you want SiteSucker to only download files having the file extensions that you've specified. Enter the file extensions separated by spaces in the text box below the control (for example, "jpg htm html"). In general, you should always include extensions for HTML files, since SiteSucker needs the hypertext links in order to find the other files that you want. For an exception to this rule, see the note below.

Never download files with these extensions

Select this control if you want SiteSucker to never download files having the file extensions that you've specified. Enter the file extensions separated by spaces in the text box below the control (for example, "zip mov mpg").

Note: SiteSucker does not enforce file type restrictions for the first file downloaded. Among other things, this allows you to download a single HTML file along with all the images on that page even though you've set this preference to only download images.

Paths

The Paths tab lets you specify which paths should be included in or excluded from the download.

Note: SiteSucker allows you to use wildcard characters in the path preferences that you specify. (To differentiate a wildcard character from a character in the path itself, each wildcard character must be escaped, meaning that it must have a '\' character before it.) The "\?" wildcard matches any single character and the "\*" wildcard matches any string.

For example, assuming that the Limit to Level preference is set to No Limit and the Limit to Directory preference is set to Web URL Directory, if you enter

http://www.theregister.com/2004/

in the "Paths to include" text box and

http://www.theregister.com/2004/02/
http://\*/cgi-bin/\*

in the "Paths to exclude" text box and you download

http://www.theregister.com/personal/mac/

SiteSucker will download http://www.theregister.com/personal/mac/index.html and any files that it references which are within the http://www.theregister.com/personal/mac/ or http://www.theregister.com/2004/ directories without downloading files inside the http://www.theregister.com/2004/02/ directory and without downloading any file that has "/cgi-bin/" in its path.