Starting up

This guide will show you how to quickly configure and run your own crawler.

Step 1. Enviroment Configuration

Before you begin using SmartCrawler you will need to define the SMARTCRAWLER_HOME environment variable which is the directory where you just unpacked the install archive.

You will also need to add $SMARTCRAWLER_HOME/bin to your path so that you can run SmartCrawler from every path on your filesystem.

To do this under Windows 2000 and Windows XP, open the Control Panel, and open the System panel. Under the Advanced tab, select the Environment Variables button. Create the new user variable SMARTCRAWLER_HOME to add the SmartCrawler path and after that create (or edit it if it exists) the variable PATH to add %SMARTCRAWLER_HOME%\bin (eg. PATH=%PATH%;%SMARTCRAWLER_HOME%\bin).

Step 2. Crawling method Configuration

A crawler needs to be configured. A good starting point is the default configuration file smartcrawler-config.xml (you can find it in $SMARTCRAWLER_HOME/bin/conf):

<?xml version="1.0" encoding="UTF-8"?>
<smartcrawler>

<engine>
    <threadsNumber>5</threadsNumber>
</engine>

<loggers>
    <logger type="TRACER" active="no"/>
    <logger type="ACCESS" active="no"/>
    <logger type="LINK" active="no"/>
    <logger type="PERMISSIONS" active="no"/>
    <logger type="EXTRACTOR" active="no"/>
    <logger type="CONSOLE" active="yes"/>
    <logger type="PERSISTER" active="no"/>
    <logger type="PROVIDER" active="no"/>
</loggers>

<retriever>
    <class>org.smartcrawler.retriever.MultiThreadHttpCallRetriever</class>
    <filters>
        <filter>
            <name>DefaultLinkFilter</name>
            <class>org.smartcrawler.filter.DefaultLinkFilter</class>
            <priority>1</priority>
        </filter>
    </filters>
</retriever>
    
<persister>
    <class>org.smartcrawler.persistence.FileSystemPersister</class>
    <persister-params>
        <persister-param>
            <param-name>preservePath</param-name>
            <param-value>true</param-value>
        </persister-param>
        <persister-param>
            <param-name>rootDir</param-name>
            <param-value>.</param-value>
        </persister-param>
    </persister-params>
</persister>
    

</smartcrawler>

Step 3. Running the Crawler

If you want perform a standard and simple crawling (using the default configuration) you can launch SmartCrawler just typing:

smartcrawler.bat startingUrl

Where startingUrl is the valid http URL from which the Crawler will begin fetching the contents.

Otherwise if you just want perform a custom crawling (using your own configuration) you must launch SmartCrawler typing:

smartcrawler.bat startingUrl myConfigFile

Where startingUrl is the valid http URL from which the Crawler will begin fetching the contents and myConfigFile is your custom xml configuration file.