This guide will show you how to quickly configure and run your own crawler.
Before you begin using SmartCrawler you will need to define the
SMARTCRAWLER_HOME
environment variable which is the directory
where you just unpacked the install archive.
You will also need to add $SMARTCRAWLER_HOME/bin
to your path
so that you can run SmartCrawler from every path on your filesystem.
To do this under Windows 2000 and Windows XP, open the Control Panel,
and open the System panel. Under the Advanced tab, select the Environment
Variables button. Create the new user variable SMARTCRAWLER_HOME
to add
the SmartCrawler path and after that create (or edit it if it exists)
the variable PATH to add %SMARTCRAWLER_HOME%\bin
(eg. PATH=%PATH%;%SMARTCRAWLER_HOME%\bin
).
A crawler needs to be configured. A good starting point is the
default configuration file smartcrawler-config.xml
(you can find it in $SMARTCRAWLER_HOME/bin/conf
):
<?xml version="1.0" encoding="UTF-8"?> <smartcrawler> <engine> <threadsNumber>5</threadsNumber> </engine> <loggers> <logger type="TRACER" active="no"/> <logger type="ACCESS" active="no"/> <logger type="LINK" active="no"/> <logger type="PERMISSIONS" active="no"/> <logger type="EXTRACTOR" active="no"/> <logger type="CONSOLE" active="yes"/> <logger type="PERSISTER" active="no"/> <logger type="PROVIDER" active="no"/> </loggers> <retriever> <class>org.smartcrawler.retriever.MultiThreadHttpCallRetriever</class> <filters> <filter> <name>DefaultLinkFilter</name> <class>org.smartcrawler.filter.DefaultLinkFilter</class> <priority>1</priority> </filter> </filters> </retriever> <persister> <class>org.smartcrawler.persistence.FileSystemPersister</class> <persister-params> <persister-param> <param-name>preservePath</param-name> <param-value>true</param-value> </persister-param> <persister-param> <param-name>rootDir</param-name> <param-value>.</param-value> </persister-param> </persister-params> </persister> </smartcrawler>
If you want perform a standard and simple crawling (using the default configuration) you can launch SmartCrawler just typing:
smartcrawler.bat startingUrl
Where startingUrl
is the valid http URL from which the Crawler will
begin fetching the contents.
Otherwise if you just want perform a custom crawling (using your own configuration) you must launch SmartCrawler typing:
smartcrawler.bat startingUrl myConfigFile
Where startingUrl
is the valid http URL from which the Crawler will
begin fetching the contents and myConfigFile
is your custom xml configuration file.