|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.apache.any23.plugin.crawler.SiteCrawler
public class SiteCrawler
A basic site crawler to extract semantic content of small/medium size sites.
| Field Summary | |
|---|---|
static int |
DEFAULT_NUM_OF_CRAWLERS
Default number of crawler instances. |
static String |
DEFAULT_PAGE_FILTER_RE
|
static Class<? extends edu.uci.ics.crawler4j.crawler.WebCrawler> |
DEFAULT_WEB_CRAWLER
Default crawler implementation. |
Pattern |
defaultFilters
Default filter applied to skip contents. |
| Constructor Summary | |
|---|---|
SiteCrawler(File storageFolder)
Constructor. |
|
| Method Summary | |
|---|---|
void |
addListener(CrawlerListener listener)
Registers a CrawlerListener to this crawler. |
int |
getMaxDepth()
|
int |
getMaxPages()
|
int |
getNumOfCrawlers()
|
int |
getPolitenessDelay()
|
Class<? extends edu.uci.ics.crawler4j.crawler.WebCrawler> |
getWebCrawler()
|
void |
removeListener(CrawlerListener listener)
Deregisters a CrawlerListener from this crawler. |
void |
setMaxDepth(int maxDepth)
Sets the maximum depth. |
void |
setMaxPages(int maxPages)
Sets the maximum collected pages. |
void |
setNumOfCrawlers(int n)
Sets the number of crawler instances. |
void |
setPolitenessDelay(int millis)
Sets the politeness delay. |
void |
setWebCrawler(Class<? extends edu.uci.ics.crawler4j.crawler.WebCrawler> c)
Sets the actual crawler clas. |
void |
start(URL seed,
boolean wait)
Starts the crawler process with the defaultFilters. |
void |
start(URL seed,
Pattern filters,
boolean wait)
Starts the crawling process. |
void |
stop()
Interrupts the crawler process if started with wait flag == false. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
public static final String DEFAULT_PAGE_FILTER_RE
public static final int DEFAULT_NUM_OF_CRAWLERS
public static final Class<? extends edu.uci.ics.crawler4j.crawler.WebCrawler> DEFAULT_WEB_CRAWLER
public final Pattern defaultFilters
| Constructor Detail |
|---|
public SiteCrawler(File storageFolder)
storageFolder - location used to store the temporary data structures used by the crawler.| Method Detail |
|---|
public int getNumOfCrawlers()
public void setNumOfCrawlers(int n)
n - an integer >= 0.public Class<? extends edu.uci.ics.crawler4j.crawler.WebCrawler> getWebCrawler()
public void setWebCrawler(Class<? extends edu.uci.ics.crawler4j.crawler.WebCrawler> c)
c - a not class.public int getMaxDepth()
-1 means no limit.public void setMaxDepth(int maxDepth)
maxDepth - maximum allowed depth. -1 means no limit.public int getMaxPages()
public void setMaxPages(int maxPages)
maxPages - maximum allowed pages. -1 means no limit.public int getPolitenessDelay()
public void setPolitenessDelay(int millis)
-1 means no politeness.
millis - delay in milliseconds.public void addListener(CrawlerListener listener)
CrawlerListener to this crawler.
listener - public void removeListener(CrawlerListener listener)
CrawlerListener from this crawler.
listener -
public void start(URL seed,
Pattern filters,
boolean wait)
throws Exception
seed - the starting URL for the crawler process.filters - filters to be applied to the crawler process. Can be null.wait - if true the process will wait for the crawler termination.
Exception
public void start(URL seed,
boolean wait)
throws Exception
defaultFilters.
seed - the starting URL for the crawler process.wait - if true the process will wait for the crawler termination.
Exceptionpublic void stop()
wait flag == false.
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||