extract_text(path)

The extract_text() function extracts text from various file types, with specific options available for PDF files.

Syntax

SELECT * FROM extract_text(path=> 'file_url', type => 'file_type', page_rage=> [start,end], per_page=>true)

Parameters

Parameter	Type	Optional	Description	Possible Values	Sample Value
`path`	String	No	The file path to extract text from	Any valid URL	`'https://example.com'`
`type`	String	Yes	Type of file	PDF, Markdown, Text, HTML	`'pdf'`
`page_rage`	Array(Int)	Yes	Extra parameter for PDF file type for the range of page numbers	Array of Start and Ending page numbers	[1, 10]
`per_page`	Bool	Yes	Extra parameter for PDF file type to chunk per Page	true, false	true

Usage

Basic usage (Raw text extraction):

SELECT * FROM extract_text(path => 'https://langdb.ai/docs/langdb/#use-cases');

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------+
| content                                                                                                                                                           | metadata | page_no |
+========================================================================================================================================================================================+
| <!doctype html>                                                                                                                                                   | {}       | 1       |
| <html lang="en" dir="ltr" class="docs-wrapper plugin-docs plugin-id-default docs-version-current docs-doc-page docs-doc-id-langdb" data-has-hydrated="false">     |          |         |
| <head>                                                                                                                                                            |          |         |
| <meta charset="UTF-8">                                                                                                                                            |          |         |
| <meta name="generator" content="Docusaurus v3.3.2">                                                                                                               |          |         |
| <title data-rh="true">Introduction | LangDB Documentation</title><meta data-rh="true" name="viewport" content="width=device-width,initial-scale=1"><meta          |          |         |
| data-rh="true" name="twitter:card" content="summary_large_image"><meta data-rh="true" property="og:url" content="https://langdb.ai/docs/langdb/"><meta            |          |         |
| data-rh="true" property="og:locale" content="en"><meta data-rh="true" name="docusaurus_locale" content="en"><meta data-rh="true" name="docsearch:language"        |          |         |
| content="en"><meta data-rh="true" name="docusaurus_version" content="current"><meta data-rh="true" name="docusaurus_tag" content="docs-default-current"><meta     |          |         |
| data-rh="true" name="docsearch:version" content="current"><meta data-rh="true" name="docsearch:docusaurus_tag" content="docs-default-current"><meta               |          |         |
| data-rh="true" property="og:title" content="Introduction | LangDB Documentation"><meta data-rh="true" name="description" content="Bridge the gap between data and |          |         |
| AI. Merge the power of SQL with the flexibility of handling both structured and unstructured data. With LangDB, developers can seamlessly create, iterate,        |          |         |
| deploy, and monitor advanced RAG applications. Experience a new way of development using SQL harnessing the full potential of your data."><meta data-rh="true"    |          |         |
| property="og:description" content="Bridge the gap between data and AI. Merge the power of SQL with the flexibility of handling both structured and unstructured   |          |         |
| data. With LangDB, developers can seamlessly create, iterate, deploy, and monitor advanced RAG applications. Experience a new way of development using SQL        |          |         |
| harnessing the full potential of your data."><link data-rh="true" rel="icon" href="/docs/img/favicon.ico"><link data-rh="true" rel="canonical"                    |          |         |
| href="https://langdb.ai/docs/langdb/"><link data-rh="true" rel="alternate" href="https://langdb.ai/docs/langdb/" hreflang="en"><link data-rh="true"               |          |         |
| rel="alternate" href="https://langdb.ai/docs/langdb/" hreflang="x-default"><!-- Google Tag Manager -->                                                            |          |         |
| <script>!function(e,t,a,n,g){e[n]=e[n]||[],e[n].push({"gtm.start":(new Date).getTime(),event:"gtm.js"});var m=t.getElementsByTagName(a)[0],r=t.createElement(a);r |          |         |
| .async=!0,r.src="https://www.googletagmanager.com/gtm.js?id=GTM-PDMCRG9K",m.parentNode.insertBefore(r,m)}(window,document,"script","dataLayer")</script>          |          |         |
|             <!-- End Google Tag Manager -->                                                                                                                       |          |         |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------+

PDF text extraction with default options:

SELECT * FROM extract_text(path => 'https://jghsenglish.edublogs.org/files/2015/02/Fahrenheit-451.pdf', type => 'PDF');

PDF text extraction with extra options:

SELECT * FROM extract_text(
  path => 'https://jghsenglish.edublogs.org/files/2015/02/Fahrenheit-451.pdf',
  type => 'PDF',
  page_range=> [15,20],
  per_page => true
);

extract_text can be used with load to extract text from Array of Bytes.

SELECT * FROM extract_text((SELECT * from load('s3://sample-onlineboutique-codefiles/onlineboutique-codefiles/just-deserts-spring-obooko-small.pdf')),
    type => 'pdf' ,
    per_page => false
);

extract_text(path)

Syntax​

Parameters​

Usage​

Syntax

Parameters

Usage