Toolset that crawls websites, generates Markdown documentation, and makes that documentation searchable via a Model Context Protocol (MCP) server for integration with tools like Cursor.
This project provides a toolset to crawl websites, generate Markdown documentation, and make that documentation searchable via a Model Context Protocol (MCP) server, designed for integration with tools like Cursor.
crawler_cli
):
crawl4ai
../storage/
by default.mcp_server
):
./storage/
directory.sentence-transformers
(multi-qa-mpnet-base-dot-v1
).storage/document_chunks_cache.pkl
) to store processed chunks and embeddings.
.md
files in ./storage/
haven't changed, the server loads directly from the cache, resulting in much faster startup times..md
file in ./storage/
is modified, added, or removed since the cache was last created.fastmcp
for clients like Cursor:
list_documents
: Lists available crawled documents.get_document_headings
: Retrieves the heading structure for a document.search_documentation
: Performs semantic search over document chunks using vector similarity.stdio
transport for use within Cursor.crawler_cli
tool to crawl a website and generate a .md
file in ./storage/
.mcp_server
(typically managed by an MCP client like Cursor)..md
files in ./storage/
.list_documents
, search_documentation
, etc.) to query the crawled content.This project uses uv
for dependency management and execution.
Install uv
: Follow the instructions on the uv website.
Clone the repository:
git clone https://github.com/alizdavoodi/MCPDocSearch.git cd MCPDocSearch
Install dependencies:
uv sync
This command creates a virtual environment (usually .venv
) and installs all dependencies listed in pyproject.toml
.
Run the crawler using the crawl.py
script or directly via uv run
.
Basic Example:
uv run python crawl.py https://docs.example.com
This will crawl https://docs.example.com
with default settings and save the output to ./storage/docs.example.com.md
.
Example with Options:
uv run python crawl.py https://docs.another.site --output ./storage/custom_name.md --max-depth 2 --keyword "API" --keyword "Reference" --exclude-pattern "*blog*"
View all options:
uv run python crawl.py --help
Key options include:
--output
/-o
: Specify output file path.--max-depth
/-d
: Set crawl depth (must be between 1 and 5).--include-pattern
/--exclude-pattern
: Filter URLs to crawl.--keyword
/-k
: Keywords for relevance scoring during crawl.--remove-links
/--keep-links
: Control HTML cleaning.--cache-mode
: Control crawl4ai
caching (DEFAULT
, BYPASS
, FORCE_REFRESH
).--wait-for
: Wait for a specific time (seconds) or CSS selector before capturing content (e.g., 5
or 'css:.content'
). Useful for pages with delayed loading.--js-code
: Execute custom JavaScript on the page before capturing content.--page-load-timeout
: Set the maximum time (seconds) to wait for a page to load.--wait-for-js-render
/--no-wait-for-js-render
: Enable a specific script to better handle JavaScript-heavy Single Page Applications (SPAs) by scrolling and clicking potential "load more" buttons. Automatically sets a default wait time if --wait-for
is not specified.Sometimes, you might want to crawl only a specific subsection of a documentation site. This often requires some trial and error with --include-pattern
and --max-depth
.
--include-pattern
: Restricts the crawler to only follow links whose URLs match the given pattern(s). Use wildcards (*
) for flexibility.--max-depth
: Controls how many "clicks" away from the starting URL the crawler will go. A depth of 1 means it only crawls pages directly linked from the start URL. A depth of 2 means it crawls those pages and pages linked from them (if they also match include patterns), and so on.Example: Crawling only the Pulsar Admin API section
Suppose you want only the content under https://pulsar.apache.org/docs/4.0.x/admin-api-*
.
https://pulsar.apache.org/docs/4.0.x/admin-api-overview/
.admin-api
: --include-pattern "*admin-api*"
.2
and increase if needed.-v
to see which URLs are being visited or skipped, which helps debug the patterns and depth.uv run python crawl.py https://pulsar.apache.org/docs/4.0.x/admin-api-overview/ -v --include-pattern "*admin-api*" --max-depth 2
Check the output file (./storage/pulsar.apache.org.md
by default in this case). If pages are missing, try increasing --max-depth
to 3
. If too many unrelated pages are included, make the --include-pattern
more specific or add --exclude-pattern
rules.
The MCP server is designed to be run by an MCP client like Cursor via the stdio
transport. The command to run the server is:
python -m mcp_server.main
However, it needs to be run from the project's root directory (MCPDocSearch
) so that Python can find the mcp_server
module.
The MCP server generates embeddings locally the first time it runs or whenever the source Markdown files in ./storage/
change. This process involves loading a machine learning model and processing all the text chunks.
To use this server with Cursor, create a .cursor/mcp.json
file in the root of this project (MCPDocSearch/.cursor/mcp.json
) with the following content:
{ "mcpServers": { "doc-query-server": { "command": "uv", "args": [ "--directory", // IMPORTANT: Replace with the ABSOLUTE path to this project directory on your machine "/path/to/your/MCPDocSearch", "run", "python", "-m", "mcp_server.main" ], "env": {} } } }
Explanation:
"doc-query-server"
: A name for the server within Cursor."command": "uv"
: Specifies uv
as the command runner."args"
:
"--directory", "/path/to/your/MCPDocSearch"
: Crucially, tells uv
to change its working directory to your project root before running the command. Replace /path/to/your/MCPDocSearch
with the actual absolute path on your system."run", "python", "-m", "mcp_server.main"
: The command uv
will execute within the correct directory and virtual environment.After saving this file and restarting Cursor, the "doc-query-server" should become available in Cursor's MCP settings and usable by the Agent (e.g., @doc-query-server search documentation for "how to install"
).
For Claude for Desktop, you can use this official documentation to set up the MCP server
Key libraries used:
crawl4ai
: Core web crawling functionality.fastmcp
: MCP server implementation.sentence-transformers
: Generating text embeddings.torch
: Required by sentence-transformers
.typer
: Building the crawler CLI.uv
: Project and environment management.beautifulsoup4
(via crawl4ai
): HTML parsing.rich
: Enhanced terminal output.The project follows this basic flow:
crawler_cli
: You run this tool, providing a starting URL and options.crawl4ai
): The tool uses crawl4ai
to fetch web pages, following links based on configured rules (depth, patterns).crawler_cli/markdown.py
): Optionally, HTML content is cleaned (removing navigation, links) using BeautifulSoup.crawl4ai
): Cleaned HTML is converted to Markdown../storage/
): The generated Markdown content is saved to a file in the ./storage/
directory.mcp_server
Startup: When the MCP server starts (usually via Cursor's config), it runs mcp_server/data_loader.py
..pkl
). If valid, it loads chunks and embeddings from the cache. Otherwise, it reads .md
files from ./storage/
.sentence-transformers
and stored in memory (and saved to cache).mcp_server/mcp_tools.py
): The server exposes tools (list_documents
, search_documentation
, etc.) via fastmcp
.search_documentation
uses the pre-computed embeddings to find relevant chunks based on semantic similarity to the query.This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to open an issue or submit a pull request.
pickle
module to cache processed data (storage/document_chunks_cache.pkl
). Unpickling data from untrusted sources can be insecure. Ensure that the ./storage/
directory is only writable by trusted users/processes.Discover shared experiences
Shared threads will appear here, showcasing real-world applications and insights from the community. Check back soon for updates!