Enables intelligent web scraping through a browser automation tool that can search Google, navigate to webpages, and extract content from various websites including GitHub, Stack Overflow, and documentation sites.
A powerful browser automation tool built with MCP (Model Controlled Program) that combines web scraping capabilities with LLM-powered intelligence. This agent can search Google, navigate to webpages, and intelligently scrape content from various websites including GitHub, Stack Overflow, and documentation sites.
This project uses a client-server architecture powered by MCP:
git clone https://github.com/yourusername/browser-automation-agent.git cd browser-automation-agent
pip install -r requirements.txt
playwright install
.env
file in the project root and add your Mistral AI API key:MISTRAL_API_KEY=your_api_key_here
python main.py
python client.py
Once both the server and client are running:
get_top_google_url
π Searches Google and returns the top result URL for a given query.
browse_and_scrape
π Navigates to a URL and scrapes content based on the website type.
scrape_github
π Specializes in extracting README content and code blocks from GitHub repositories.
scrape_stackoverflow
π¬ Extracts questions, answers, comments, and code blocks from Stack Overflow pages.
scrape_documentation
π Optimized for extracting documentation content and code examples.
scrape_generic
π Extracts paragraph text and code blocks from generic websites.
browser-automation-agent/
βββ main.py # MCP server implementation
βββ client.py # Mistral AI client implementation
βββ requirements.txt # Project dependencies
βββ .env # Environment variables (API keys)
βββ README.md # Project documentation
The agent generates two types of output files with timestamps:
final_page_YYYYMMDD_HHMMSS.png
: Screenshot of the final page statescraped_content_YYYYMMDD_HHMMSS.txt
: Extracted text content from the pageYou can modify the following parameters in the code:
width
and height
in browse_and_scrape
headless=True
for invisible browser operationnum_results
in get_top_google_url
playwright install
.env
filemain.py
in client.py
if neededContributions are welcome! Please feel free to submit a Pull Request.
Built with π§© MCP, π Playwright, and π§ Mistral AI
Discover shared experiences
Shared threads will appear here, showcasing real-world applications and insights from the community. Check back soon for updates!