Robot control file parserorg.CL4.Robots is a free open-source .NET library for use by Internet robots. Currently, it can parse robot control files (robots.txt) and determine which Web resources a robot is allowed to access. The implementation is a slightly more permissive version of Martijn Koster's 1996 RFC Draft Memo on Web Robots Control, with optional support for the blank Disallow lines of the current 1994 document (A Standard for Robot Exclusion), so it is essentially backwards-compatible. It also supports Googlebot's non-standard wildcard paths, such as /*.doc.
This library is not only free but public domain, so you can use, modify and distribute it as you wish. Usage exampleThe main class, RobotsFileParser, has various constructors for robots.txt files that are on the local machine, on the Web, or already cached in memory, but this is the simplest usage: using org.CL4.Robots; There follows a description of the public members of the RobotsFileParser class. For more complete documentation, and to see which methods throw which exceptions, you will have to read the (liberally commented) source code. Public propertiesAllRestricted gets a value indicating whether access to the current site is completely forbidden because access to the robot control file was forbidden. Options gets or sets a value indicating how robot control files should be parsed. This is a bitwise combination of the flags in the ParseOptions enumeration, described below. SiteBase gets the URI of the Web site whose robot control file has been parsed. The URI includes the scheme and authority but no path segments. This would usually correspond to the Web site's home page. Public methodsClear resets the parser to its original empty state. FindMatchingUserAgent finds the user agent token from the robot control file that applies to a named robot, or the value of TOKEN_ALL_USER_AGENTS if there is no token match, or null if there is no token match and TOKEN_ALL_USER_AGENTS was not present in the robot control file. GetRobotsFileUri returns the URI at which the robot control file for a Web site is expected to reside. You do not need to call this method. IsAllowed determines whether a named user agent is permitted to access the resource at a specified URI. You must have parsed the robot control file before this method will succeed. IsRfc1808Path determines whether a string is a valid "path", as defined by RFC 1808. You do not need to call this method. IsRfc1945Token determines whether a string is a token, as defined by RFC 1945. You do not need to call this method. Parse reads and interprets the contents of a robot control file. There are several overloaded versions: you can pass the Uri of any page on the Web site, the FileInfo of a local file, or a string[] array of lines in memory. You can also call the constructor with any of these items and have the file parsed automatically. Public constantsROBOTS_FILE_NAME is the filename used for robot control files. Its value is "robots.txt". TOKEN_ALL_USER_AGENTS is the user agent token that represents all user agents. Its value is "*". ParseOptions enumerationThe ParseOptions enumeration is used to set the Options property of RobotsFileParser, and consists of a bitwise combination of flags. There are three special values: None (no options set), All (all options set), and Defaults (the same as All). IgnoreFieldNameCase means that field names in robot control files (such as "Allow" and "User-agent") are accepted in any combination of upper and lower case, not just the standard casing. AcceptBlankDisallow means that blank Disallow lines are understood to mean that the user agent is allowed to access any URI on the site, as defined in A Standard for Robot Exclusion (1994) but not the 1996 RFC Draft Memo. AsteriskWildcard means that the asterisk character (*) in robot control files should be interpreted as a wildcard representing any character sequence. Wildcard matching is not part of the robot control standard, under which * is a valid path character, but it is supported by Googlebot and possibly others. ExceptionsAll library-specific exceptions inherit from the abstract RobotException. ContentTypeException is thrown when a robot control file on the Web has an inappropriate content type (i.e. not text/plain). The ContentType property holds the content type that was found. DownloadFailedException is thrown when a robot control file cannot be downloaded due to a transfer problem (not just an HTTP status code indicating failure). The Address property holds the URI of the file that could not be downloaded. InvalidUserAgentException is thrown when an invalid user agent token is specified. (If the token occurred in a robots.txt file, however, it is simply truncated.) User agent tokens should comply with RFC 1945. The UserAgent property holds the offending token. SiteMismatchException is thrown when the specified URI does not match the site to which the robot control file referred — for example, when you parse the robot control file for http://CL4.org/ and then ask whether http://www.CL4.org/secret/ is accessible. (Adding the www makes it potentially a different site.)
|