crawl(file_url, max_depth)
The crawl()
function is a powerful SQL extension that allows you to perform web scraping operations directly within your database queries. This function crawls web pages starting from a given URL and returns the crawled data.
Function Signature
crawl(url , [additional_parameters])
Parameters
Parameter | Type | Optional | Description | Possible Values | Sample Value |
---|---|---|---|---|---|
url | String | No | The starting URL for the web scraping operation | Any valid URL | 'https://example.com' |
max_depth | Int | No | Maximum depth of links to follow from the starting URL | Any non-negative integer | 2 |
disallow_domains | Array(String) | Yes | Set of domains to exclude from scraping | Array of domain names | ['ads.example.com'] |
max_concurrent_requests | Int | Yes | Maximum number of concurrent HTTP requests | Any positive integer | 10 |
skip_non_successful_responses | Bool | Yes | Whether to skip pages that don't return a successful HTTP status | true, false | true |
allowed_domains | Array(String) | Yes | Set of domains to restrict scraping to | Array of domain names | ['example.com','blog.example.com'] |
request_delay_ms | Int | Yes | Delay between requests in milliseconds | Any non-negative integer | 1000 |
robots | Bool | Yes | Whether to respect robots.txt rules | true, false | false |
javascript | Bool | Yes | Whether to execute JavaScript on the page | true, false | true |
selector | String | Yes | CSS selector to extract specific elements from the page | Any valid CSS selector | '#menu-item' |
Usage Examples
- Basic usage with default parameters:
SELECT * FROM crawl('https://langdb.ai/docs/langdb/', max_depth=>2);
+---------------------------------------------------------------------------------------------------+-------+
| url | depth |
+===========================================================================================================+
| https://favcsepnch.execute-api.ap-southeast-1.amazonaws.com/js?url=https://langdb.ai/docs/langdb/ | 1 |
|---------------------------------------------------------------------------------------------------+-------|
| https://langdb.ai/docs/langdb/#key-benefits | 2 |
|---------------------------------------------------------------------------------------------------+-------|
| https://langdb.ai/docs/langdb/#design-goals | 2 |
|---------------------------------------------------------------------------------------------------+-------|
| https://langdb.ai/docs/langdb/#why-sql | 2 |
|---------------------------------------------------------------------------------------------------+-------|
| https://langdbcommunity.slack.com/join/shared_invite/zt-2haf5kj6a-d7NX6TFJUPX45w~Ag4dzlg | 2 |
|---------------------------------------------------------------------------------------------------+-------|
| https://langdb.ai/docs/langdb/#combining-vector-and-relational-queries-with-ai-functions | 2 |
|---------------------------------------------------------------------------------------------------+-------|
| https://langdb.ai/docs/langdb/#deployment | 2 |
|---------------------------------------------------------------------------------------------------+-------|
| https://langdb.ai/docs/langdb/#security-is-key-for-rag | 2 |
|---------------------------------------------------------------------------------------------------+-------|
| https://langdb.ai/docs/langdb/#data-flow-and-ai-integration | 2 |
|---------------------------------------------------------------------------------------------------+-------|
| https://langdb.ai/docs/langdb/#use-cases | 2 |
|---------------------------------------------------------------------------------------------------+-------|
| https://github.com/langdb/langdb-docs/tree/main/docs/langdb.mdx | 2 |
+---------------------------------------------------------------------------------------------------+-------+
-
Advanced usage with custom parameters:
SELECT * FROM crawl(
'https://www.sportas.lt/rubrika/naujienos/ivartis/lietuva',
max_depth => 2,
robots => false,
allowed_domains=> ['www.sportas.lt', 'basketnews.lt'],
selector => '#menu-item-id-969 .sub-menu-item > a',
javascript => false
);+----------------------------------------------------------------------+-------+
| url | depth |
+==============================================================================+
| https://www.sportas.lt/rubrika/naujienos/ivartis/lietuva | 1 |
|----------------------------------------------------------------------+-------|
| https://www.sportas.lt/rubrika/naujienos/tenisas/grand-slam | 2 |
|----------------------------------------------------------------------+-------|
| https://www.sportas.lt/rubrika/naujienos/tenisas/atp-world-tour | 2 |
|----------------------------------------------------------------------+-------|
| https://www.sportas.lt/rubrika/naujienos/tenisas/davis-cup | 2 |
|----------------------------------------------------------------------+-------|
| https://www.sportas.lt/rubrika/naujienos/tenisas/atp-challenger-tour | 2 |
|----------------------------------------------------------------------+-------|
| https://www.sportas.lt/rubrika/naujienos/tenisas/kita | 2 |
+----------------------------------------------------------------------+-------+