A Model Context Protocol server that performs comprehensive web research by combining Tavily Search and Crawl APIs to gather extensive information and provide structured JSON output tailored for LLMs to create detailed markdown documents.
Deep Research MCP Server
The Deep Research MCP Server is a Model Context Protocol (MCP) compliant server designed to perform comprehensive web research. It leverages Tavily's powerful Search and new Crawl APIs to gather extensive, up-to-date information on a given topic. The server then aggregates this data along with documentation generation instructions into a structured JSON output, perfectly tailored for Large Language Models (LLMs) to create detailed and high-quality markdown documents.
DOCUMENTATION_PROMPT
environment variable.documentation_prompt
argument directly to the tool.You can run the server directly using npx
without a global installation:
npx @pinkpixel/deep-research-mcp
npm install -g @pinkpixel/deep-research-mcp
Then you can run it using:
deep-research-mcp
git clone [https://github.com/your-username/deep-research-mcp.git](https://github.com/your-username/deep-research-mcp.git) cd deep-research-mcp
npm install
The server requires a Tavily API key and can optionally accept a custom documentation prompt.
{ "mcpServers": { "deep-research": { "command": "npx", "args": [ "-y", "@pinkpixel/deep-research-mcp" ], "env": { "TAVILY_API_KEY": "tvly-YOUR_ACTUAL_API_KEY_HERE", // Required "DOCUMENTATION_PROMPT": "Your custom, detailed instructions for the LLM on how to generate markdown documents from the research data...", // Optional - if not provided, the default prompt will be used "RESEARCH_OUTPUT_PATH": "/path/to/your/research/output/folder" // Optional - if not provided, the default path will be used } } } }
Set the TAVILY_API_KEY
environment variable to your Tavily API key.
Methods:
.env
file: Create a .env
file in the project root (if running locally for development):
TAVILY_API_KEY="tvly-YOUR_ACTUAL_API_KEY"
TAVILY_API_KEY="tvly-YOUR_ACTUAL_API_KEY" npx @pinkpixel/deep-research-mcp
You can override the default comprehensive documentation prompt by setting the DOCUMENTATION_PROMPT
environment variable.
Methods (in order of precedence):
documentation_prompt
parameter passed when calling the deep-research-tool
takes highest precedenceDOCUMENTATION_PROMPT
environment variableSetting via .env
file:
DOCUMENTATION_PROMPT="Your custom, detailed instructions for the LLM on how to generate markdown..."
Or directly in command line:
DOCUMENTATION_PROMPT="Your custom prompt..." TAVILY_API_KEY="tvly-YOUR_KEY" npx @pinkpixel/deep-research-mcp
You can specify where research documents and images should be saved. If not configured, a default path in the user's Documents folder with a timestamp will be used.
Methods (in order of precedence):
output_path
parameter passed when calling the deep-research-tool
takes highest precedenceRESEARCH_OUTPUT_PATH
environment variable~/Documents/research/YYYY-MM-DDTHH-MM-SS/
Setting via .env
file:
RESEARCH_OUTPUT_PATH="/path/to/your/research/folder"
Or directly in command line:
RESEARCH_OUTPUT_PATH="/path/to/your/research/folder" TAVILY_API_KEY="tvly-YOUR_KEY" npx @pinkpixel/deep-research-mcp
Development (with auto-reload): If you've cloned the repository and are in the project directory:
npm run dev
This uses nodemon
and ts-node
to watch for changes and restart the server.
Production/Standalone: First, build the TypeScript code:
npm run build
Then, start the server:
npm start
With NPX or Global Install: (Ensure environment variables are set as described in Configuration)
npx @pinkpixel/deep-research-mcp
or if globally installed:
deep-research-mcp
The server will listen for MCP requests on stdio.
CallToolRequest
to this MCP server, specifying the deep-research-tool
and providing a query and other optional parameters.deep-research-tool
first performs a Tavily Search to find relevant web sources.documentation_instructions
, to generate a comprehensive markdown document.deep-research-tool
This is the primary tool exposed by the server.
The tool returns a JSON string with the following structure:
{ "documentation_instructions": "string", // The detailed prompt for the LLM to generate the markdown. "original_query": "string", // The initial query provided to the tool. "search_summary": "string | null", // An LLM-generated answer/summary from Tavily's search phase (if include_answer was true). "research_data": [ // Array of findings, one element per source. { "search_rank": "number", "original_url": "string", // URL of the source found by search. "title": "string", // Title of the web page. "initial_content_snippet": "string",// Content snippet from the initial search result. "search_score": "number | undefined",// Relevance score from Tavily search. "published_date": "string | undefined",// Publication date (if 'news' topic and available). "crawled_data": [ // Array of pages crawled starting from original_url. { "url": "string", // URL of the specific page crawled. "raw_content": "string | null", // Rich, extracted content from this page. "images": ["string", "..."] // Array of image URLs found on this page. } ], "crawl_errors": ["string", "..."] // Array of error messages if crawling this source failed or had issues. } // ... more sources ], "output_path": "string" // Path where research documents and images should be saved. }
The deep-research-tool
accepts the following parameters in its arguments
object:
query
(string, required): The main research topic or question.documentation_prompt
(string, optional): Custom prompt for LLM documentation generation.
DOCUMENTATION_PROMPT
environment variable and the server's built-in default prompt. If not provided here, the server checks the environment variable, then falls back to the default.output_path
(string, optional): Path where generated research documents and images should be saved.
RESEARCH_OUTPUT_PATH
environment variable. If neither is set, a timestamped folder in the user's Documents directory will be used.search_depth
(string, optional, default: "advanced"
): Depth of the initial Tavily search.
"basic"
, "advanced"
. Advanced search is tailored for more relevant sources.topic
(string, optional, default: "general"
): Category for the Tavily search.
"general"
, "news"
.days
(number, optional): For topic: "news"
, the number of days back from the current date to include search results.time_range
(string, optional): Time range for search results (e.g., "d"
for day, "w"
for week, "m"
for month, "y"
for year).max_search_results
(number, optional, default: 7
): Maximum number of search results to retrieve and consider for crawling (1-20).chunks_per_source
(number, optional, default: 3
): For search_depth: "advanced"
, the number of content chunks to retrieve from each source (1-3).include_search_images
(boolean, optional, default: false
): Include a list of query-related image URLs from the initial search.include_search_image_descriptions
(boolean, optional, default: false
): Include image descriptions along with URLs from the initial search.include_answer
(boolean or string, optional, default: false
): Include an LLM-generated answer from Tavily based on search results.
true
(implies "basic"
), false
, "basic"
, "advanced"
.include_raw_content_search
(boolean, optional, default: false
): Include the cleaned and parsed HTML content of each initial search result.include_domains_search
(array of strings, optional, default: []
): A list of domains to specifically include in the search results.exclude_domains_search
(array of strings, optional, default: []
): A list of domains to specifically exclude from the search results.search_timeout
(number, optional, default: 60
): Timeout in seconds for Tavily search requests.crawl_max_depth
(number, optional, default: 1
): Max depth of the crawl from the base URL. 0
means only the base URL, 1
means the base URL and links found on it, etc.crawl_max_breadth
(number, optional, default: 5
): Max number of links to follow per level of the crawl tree (i.e., per page).crawl_limit
(number, optional, default: 10
): Total number of links the crawler will process starting from a single root URL before stopping.crawl_instructions
(string, optional): Natural language instructions for the crawler for how to approach crawling the site.crawl_select_paths
(array of strings, optional, default: []
): Regex patterns to select only URLs with specific path patterns for crawling (e.g., "/docs/.*"
).crawl_select_domains
(array of strings, optional, default: []
): Regex patterns to restrict crawling to specific domains or subdomains (e.g., "^docs\\.example\\.com$"
). If crawl_allow_external
is false (default) and this is empty, crawling is focused on the domain of the URL being crawled. This overrides that focus.crawl_exclude_paths
(array of strings, optional, default: []
): Regex patterns to exclude URLs with specific path patterns from crawling.crawl_exclude_domains
(array of strings, optional, default: []
): Regex patterns to exclude specific domains or subdomains from crawling.crawl_allow_external
(boolean, optional, default: false
): Whether to allow the crawler to follow links to external domains.crawl_include_images
(boolean, optional, default: true
): Whether to extract image URLs from the crawled pages.crawl_categories
(array of strings, optional, default: []
): Filter URLs for crawling using predefined categories (e.g., "Blog"
, "Documentation"
, "Careers"
). Refer to Tavily Crawl API for all options.crawl_extract_depth
(string, optional, default: "advanced"
): Depth of content extraction during crawl.
"basic"
, "advanced"
. Advanced retrieves more data (tables, embedded content) but may have higher latency.crawl_timeout
(number, optional, default: 180
): Timeout in seconds for each Tavily Crawl request.The documentation_prompt
is an essential part of this tool as it guides the LLM in how to format and structure the research findings. The system uses this precedence to determine which prompt to use:
If the LLM/agent provides a documentation_prompt
parameter in the tool call:
If no parameter is provided in the tool call, but the DOCUMENTATION_PROMPT
environment variable is set:
If neither of the above are set:
This flexibility allows:
The output_path
parameter determines where research documents and images will be saved. This is especially important when the LLM needs to:
The system follows this precedence to determine the output path:
If the LLM/agent provides an output_path
parameter in the tool call:
If no parameter is provided, but the RESEARCH_OUTPUT_PATH
environment variable is set:
If neither of the above are set:
~/Documents/research/YYYY-MM-DDTHH-MM-SS/
The LLM receives the final resolved output path in the tool's response JSON as the output_path
field, so it always knows where to save generated content.
Note for LLMs: When processing the tool results, check the output_path
field to determine where to save any files you generate. This path is guaranteed to be present in the response.
As an LLM using the output of the deep-research-tool
, your primary goal is to generate a comprehensive, accurate, and well-structured markdown document that addresses the original_query
.
Key Steps:
documentation_instructions
, original_query
, search_summary
, and research_data
.documentation_instructions
: This field contains the primary set of guidelines for creating the markdown document. It will either be the server's extensive default prompt (focused on high-quality technical documentation) or a custom prompt provided by the user. Follow these instructions meticulously regarding content quality, style, structure, markdown formatting, and handling of technical details.research_data
for Content:
research_data
array is your main source of information. Each object in this array represents a distinct web source.title
, original_url
, and initial_content_snippet
for context.crawled_data
array within each source. Specifically, the raw_content
field of each crawled_data
object contains the rich text extracted from that page.research_data
to provide a comprehensive view. Do not just list content from one source after another.crawled_data[].images
are present, you can mention them or list their URLs if appropriate and aligned with the documentation_instructions
.crawl_errors
are present for a source, it means that particular source might be incomplete. You can choose to note this subtly if it impacts coverage.original_query
: The final document must comprehensively answer or address the original_query
.search_summary
: If the search_summary
field is present (from Tavily's include_answer
feature), it can serve as a helpful starting point, an executive summary, or a way to frame the introduction. However, the main body of your document should be built from the more detailed research_data
.raw_content
. You must process, understand, synthesize, rephrase, and organize the information from various sources into a coherent, well-written document that flows logically, as per the documentation_instructions
.documentation_instructions
(headings, lists, code blocks, emphasis, links, etc.).research_data
can be extensive. If you have limitations on processing large inputs, the system calling you might need to provide you with chunks of the research_data
or make multiple requests to you to build the document section by section. The deep-research-tool
itself will always attempt to return all collected data in one JSON output.documentation_instructions
. Do not oversimplify.documentation_instructions
include guidelines for visual appeal (like colored text or emojis using HTML), apply them judiciously.Example LLM Invocation Thought Process:
Agent to LLM:
"Okay, I've called the deep-research-tool
with the query '\What are the latest advancements in quantum-resistant cryptography?' and requested 5 sources with advanced crawling. Here's the JSON output:
{ ... (JSON output from the tool) ... }
Now, using the documentation_instructions
provided within this JSON, and the research_data
, please generate a comprehensive markdown document on 'The Latest Advancements in Quantum-Resistant Cryptography'. Ensure you follow all formatting and content guidelines from the instructions."
CallToolRequest
(Conceptual Arguments)An agent might make a call to the MCP server with arguments like this:
{ "name": "deep-research-tool", "arguments": { "query": "Explain the architecture of modern data lakes and data lakehouses.", "max_search_results": 5, "search_depth": "advanced", "topic": "general", "crawl_max_depth": 1, "crawl_extract_depth": "advanced", "include_answer": true, "documentation_prompt": "Generate a highly technical whitepaper. Start with an abstract, then introduction, detailed sections for data lakes, data lakehouses, comparison, use cases, and a future outlook. Use academic tone. Include all diagrams mentioned by URL if possible as [Diagram: URL].", "output_path": "C:/Users/username/Documents/research/datalakes-whitepaper" } }
TAVILY_API_KEY
is correctly set and valid.@modelcontextprotocol/sdk
and @tavily/core
are installed and up-to-date.Contributions are welcome! Please feel free to submit issues, fork the repository, and create pull requests.
This project is licensed under the MIT License - see the LICENSE file for details.
Discover shared experiences
Shared threads will appear here, showcasing real-world applications and insights from the community. Check back soon for updates!