Skip to main content

extract_text(path)

The extract_text() function extracts text from various file types, with specific options available for PDF files.

Syntax

SELECT * FROM extract_text(path=> 'file_url', type => 'file_type', page_rage=> [start,end], per_page=>true)

Parameters

ParameterTypeOptionalDescriptionPossible ValuesSample Value
pathStringNoThe file path to extract text fromAny valid URL'https://example.com'
typeStringYesType of filePDF, Markdown, Text, HTML'pdf'
page_rageArray(Int)YesExtra parameter for PDF file type for the range of page numbersArray of Start and Ending page numbers[1, 10]
per_pageBoolYesExtra parameter for PDF file type to chunk per Pagetrue, falsetrue

Usage

  • Basic usage (Raw text extraction):
SELECT * FROM extract_text(path => 'https://langdb.ai/docs/langdb/#use-cases');
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------+
| content | metadata | page_no |
+========================================================================================================================================================================================+
| <!doctype html> | {} | 1 |
| <html lang="en" dir="ltr" class="docs-wrapper plugin-docs plugin-id-default docs-version-current docs-doc-page docs-doc-id-langdb" data-has-hydrated="false"> | | |
| <head> | | |
| <meta charset="UTF-8"> | | |
| <meta name="generator" content="Docusaurus v3.3.2"> | | |
| <title data-rh="true">Introduction | LangDB Documentation</title><meta data-rh="true" name="viewport" content="width=device-width,initial-scale=1"><meta | | |
| data-rh="true" name="twitter:card" content="summary_large_image"><meta data-rh="true" property="og:url" content="https://langdb.ai/docs/langdb/"><meta | | |
| data-rh="true" property="og:locale" content="en"><meta data-rh="true" name="docusaurus_locale" content="en"><meta data-rh="true" name="docsearch:language" | | |
| content="en"><meta data-rh="true" name="docusaurus_version" content="current"><meta data-rh="true" name="docusaurus_tag" content="docs-default-current"><meta | | |
| data-rh="true" name="docsearch:version" content="current"><meta data-rh="true" name="docsearch:docusaurus_tag" content="docs-default-current"><meta | | |
| data-rh="true" property="og:title" content="Introduction | LangDB Documentation"><meta data-rh="true" name="description" content="Bridge the gap between data and | | |
| AI. Merge the power of SQL with the flexibility of handling both structured and unstructured data. With LangDB, developers can seamlessly create, iterate, | | |
| deploy, and monitor advanced RAG applications. Experience a new way of development using SQL harnessing the full potential of your data."><meta data-rh="true" | | |
| property="og:description" content="Bridge the gap between data and AI. Merge the power of SQL with the flexibility of handling both structured and unstructured | | |
| data. With LangDB, developers can seamlessly create, iterate, deploy, and monitor advanced RAG applications. Experience a new way of development using SQL | | |
| harnessing the full potential of your data."><link data-rh="true" rel="icon" href="/docs/img/favicon.ico"><link data-rh="true" rel="canonical" | | |
| href="https://langdb.ai/docs/langdb/"><link data-rh="true" rel="alternate" href="https://langdb.ai/docs/langdb/" hreflang="en"><link data-rh="true" | | |
| rel="alternate" href="https://langdb.ai/docs/langdb/" hreflang="x-default"><!-- Google Tag Manager --> | | |
| <script>!function(e,t,a,n,g){e[n]=e[n]||[],e[n].push({"gtm.start":(new Date).getTime(),event:"gtm.js"});var m=t.getElementsByTagName(a)[0],r=t.createElement(a);r | | |
| .async=!0,r.src="https://www.googletagmanager.com/gtm.js?id=GTM-PDMCRG9K",m.parentNode.insertBefore(r,m)}(window,document,"script","dataLayer")</script> | | |
| <!-- End Google Tag Manager --> | | |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------+
  • PDF text extraction with default options:
SELECT * FROM extract_text(path => 'https://jghsenglish.edublogs.org/files/2015/02/Fahrenheit-451.pdf', type => 'PDF');
  • PDF text extraction with extra options:
SELECT * FROM extract_text(
path => 'https://jghsenglish.edublogs.org/files/2015/02/Fahrenheit-451.pdf',
type => 'PDF',
page_range=> [15,20],
per_page => true
);
  • extract_text can be used with load to extract text from Array of Bytes.
SELECT * FROM extract_text((SELECT * from load('s3://sample-onlineboutique-codefiles/onlineboutique-codefiles/just-deserts-spring-obooko-small.pdf')),
type => 'pdf' ,
per_page => false
);