Skip to main content

crawl(file_url, max_depth)

The crawl() function is a powerful SQL extension that allows you to perform web scraping operations directly within your database queries. This function crawls web pages starting from a given URL and returns the crawled data.

Function Signature

crawl(url , [additional_parameters])

Parameters

ParameterTypeOptionalDescriptionPossible ValuesSample Value
urlStringNoThe starting URL for the web scraping operationAny valid URL'https://example.com'
max_depthIntNoMaximum depth of links to follow from the starting URLAny non-negative integer2
disallow_domainsArray(String)YesSet of domains to exclude from scrapingArray of domain names['ads.example.com']
max_concurrent_requestsIntYesMaximum number of concurrent HTTP requestsAny positive integer10
skip_non_successful_responsesBoolYesWhether to skip pages that don't return a successful HTTP statustrue, falsetrue
allowed_domainsArray(String)YesSet of domains to restrict scraping toArray of domain names['example.com','blog.example.com']
request_delay_msIntYesDelay between requests in millisecondsAny non-negative integer1000
robotsBoolYesWhether to respect robots.txt rulestrue, falsefalse
javascriptBoolYesWhether to execute JavaScript on the pagetrue, falsetrue
selectorStringYesCSS selector to extract specific elements from the pageAny valid CSS selector'#menu-item'

Usage Examples

  • Basic usage with default parameters:
SELECT * FROM crawl('https://langdb.ai/docs/langdb/', max_depth=>2);
+---------------------------------------------------------------------------------------------------+-------+
| url | depth |
+===========================================================================================================+
| https://favcsepnch.execute-api.ap-southeast-1.amazonaws.com/js?url=https://langdb.ai/docs/langdb/ | 1 |
|---------------------------------------------------------------------------------------------------+-------|
| https://langdb.ai/docs/langdb/#key-benefits | 2 |
|---------------------------------------------------------------------------------------------------+-------|
| https://langdb.ai/docs/langdb/#design-goals | 2 |
|---------------------------------------------------------------------------------------------------+-------|
| https://langdb.ai/docs/langdb/#why-sql | 2 |
|---------------------------------------------------------------------------------------------------+-------|
| https://langdbcommunity.slack.com/join/shared_invite/zt-2haf5kj6a-d7NX6TFJUPX45w~Ag4dzlg | 2 |
|---------------------------------------------------------------------------------------------------+-------|
| https://langdb.ai/docs/langdb/#combining-vector-and-relational-queries-with-ai-functions | 2 |
|---------------------------------------------------------------------------------------------------+-------|
| https://langdb.ai/docs/langdb/#deployment | 2 |
|---------------------------------------------------------------------------------------------------+-------|
| https://langdb.ai/docs/langdb/#security-is-key-for-rag | 2 |
|---------------------------------------------------------------------------------------------------+-------|
| https://langdb.ai/docs/langdb/#data-flow-and-ai-integration | 2 |
|---------------------------------------------------------------------------------------------------+-------|
| https://langdb.ai/docs/langdb/#use-cases | 2 |
|---------------------------------------------------------------------------------------------------+-------|
| https://github.com/langdb/langdb-docs/tree/main/docs/langdb.mdx | 2 |
+---------------------------------------------------------------------------------------------------+-------+
  • Advanced usage with custom parameters:

    SELECT * FROM crawl(
    'https://www.sportas.lt/rubrika/naujienos/ivartis/lietuva',
    max_depth => 2,
    robots => false,
    allowed_domains=> ['www.sportas.lt', 'basketnews.lt'],
    selector => '#menu-item-id-969 .sub-menu-item > a',
    javascript => false
    );
    +----------------------------------------------------------------------+-------+
    | url | depth |
    +==============================================================================+
    | https://www.sportas.lt/rubrika/naujienos/ivartis/lietuva | 1 |
    |----------------------------------------------------------------------+-------|
    | https://www.sportas.lt/rubrika/naujienos/tenisas/grand-slam | 2 |
    |----------------------------------------------------------------------+-------|
    | https://www.sportas.lt/rubrika/naujienos/tenisas/atp-world-tour | 2 |
    |----------------------------------------------------------------------+-------|
    | https://www.sportas.lt/rubrika/naujienos/tenisas/davis-cup | 2 |
    |----------------------------------------------------------------------+-------|
    | https://www.sportas.lt/rubrika/naujienos/tenisas/atp-challenger-tour | 2 |
    |----------------------------------------------------------------------+-------|
    | https://www.sportas.lt/rubrika/naujienos/tenisas/kita | 2 |
    +----------------------------------------------------------------------+-------+