Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

nodejs javascript npm crawler scraper automation typescript web-crawler headless scraping crawling web-scraping web-crawling headless-chrome apify puppeteer playwright

Updated Jul 27, 2024
TypeScript

sjdirect / abot

Star

Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.

Updated Jul 17, 2024
C#

xianhu / PSpider

Star

简单易用的Python爬虫框架，QQ交流群：597510560

python crawler multi-threading spider multiprocessing web-crawler proxies python-spider web-spider

Updated Jun 10, 2022
Python

adithya-s-k / omniparse

Star

Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks

ocr parser-library web-crawler parse-server whisper-api ingestion-api vision-transformer omniparser

Updated Jul 22, 2024
Python

apache / incubator-stormcrawler

Star

A scalable, mature and versatile web crawler based on Apache Storm

java crawler web-crawler distributed apache-storm stormcrawler

Updated Jul 25, 2024
HTML

apify / crawlee-python

Star

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

python crawler scraper automation web-crawler headless scraping crawling pip web-scraping beautifulsoup web-crawling headless-chrome apify playwright

Updated Jul 26, 2024
Python

hyunwoongko / kochat

Sponsor

Star

Opensource Korean chatbot framework

deep-learning web-crawler chatbot korean deeplearning sentence-classification korean-chatbot sequance-tagging

Updated May 22, 2023
Python

microfisher / Strong-Web-Crawler

Star

基于C#.NET+PhantomJS+Sellenium的高级网络爬虫程序。可执行Javascript代码、触发各类事件、操纵页面Dom结构。

crawler phantomjs web-crawler sellenium

Updated Oct 25, 2019
C#

USCDataScience / sparkler

Star

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

search search-engine distributed-systems information-retrieval big-data spark solr web-crawler nutch tika

Updated Mar 30, 2023
Java

VIDA-NYU / ache

Star

ACHE is a web crawler for domain-specific search.

web-crawler web-scraping hacktoberfest web-spider focused-crawler domain-specific-search web-search

Updated Aug 24, 2023
Java

Algebra-FUN / WeReadScan

Star

扫描“微信读书”已购图书并下载本地PDF的爬虫

web-crawler selenium weread book-downloader

Updated Sep 19, 2023
Python

lucasxlu / LagouJob

Star

Data Analysis & Mining for lagou.com

nlp machine-learning data-mining web-crawler python3 data-analysis lagou

Updated Apr 19, 2019
Python

platonai / PulsarRPA

Star

Automate webpages at scale, scrape web data completely and accurately with high performance, distributed RPA.

crawler data-science data-mining scraper web-crawler scraping web-scraping web-mining web-automation rpa web-sql

Updated Jul 25, 2024
Kotlin

postmodern / spidr

Sponsor

Star

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

ruby crawler scraper web spider web-crawler web-scraper web-scraping web-spider spider-links

Updated Jan 25, 2024
Ruby

elliotxx / zhihu-crawler-people

Star

A simple distributed crawler for zhihu && data analysis

python crawler spider web-crawler python-crawler web-spider

Updated Dec 7, 2022
Python

Improve this page

Add a description, image, and links to the web-crawler topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the web-crawler topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

web-crawler

Here are 893 public repositories matching this topic...

ssssssss-team / spider-flow

crawlab-team / crawlab

apache / nutch

BruceDone / awesome-crawler

mendableai / firecrawl

apify / crawlee

sjdirect / abot

xianhu / PSpider

adithya-s-k / omniparse

apache / incubator-stormcrawler

apify / crawlee-python

hyunwoongko / kochat

microfisher / Strong-Web-Crawler

USCDataScience / sparkler

VIDA-NYU / ache

Algebra-FUN / WeReadScan

lucasxlu / LagouJob

platonai / PulsarRPA

postmodern / spidr

elliotxx / zhihu-crawler-people

Improve this page

Add this topic to your repo