This repository contains a Python script that updates daily a list of Flow-related sites, GitHub repositories, and GitHub discussions and converts them into Markdown files. The resulting .md
files are intended for AI ingestion, Retrieval-Augmented Generation (RAG) pipelines, chatbots, or any other knowledge base platform that benefits from structured text.
We want a single repository that periodically crawls all relevant Flow ecosystem content—documentation, code examples, and community discussions—and stores them in a consolidated Markdown format. You can then feed these files into:
- ChatGPT plugins (for enhanced Q&A)
- Retrieval-Augmented Generation (indexing and searching them in a vector database)
- Discord/Telegram bots that cite official doc sections
- Any other knowledge base for advanced Q&A or search.
The Python script performs domain-limited BFS (Breadth-First Search) and specialized scraping logic based on each URL:
- Non-GitHub URLs are treated as “normal” websites.
- The script fetches each page and removes
<script>
,<style>
,<noscript>
tags. - Then it uses
markdownify
to convert the remaining HTML into Markdown. - It recurses only within the same domain to avoid crawling unrelated pages.
- For GitHub repo links like
https://github.com/onflow/flow-ft/
, the script visits:- The repo root
tree/(main|master)/...
subdirectoriesblob/(main|master)/...
file pages
- Files with certain extensions (like
.cdc
,.md
,.json
, etc.) or anyREADME
are downloaded in their raw form fromraw.githubusercontent.com
. - The file contents are saved in a
.md
file, wrapped in triple backticks for easy code parsing.
- For
https://github.com/orgs/onflow/discussions
, the script:- Crawls the listing pages, discovers discussion links like
/orgs/onflow/discussions/1330
- For each thread, it extracts only the text from user posts (skipping the GitHub UI) and converts it to Markdown.
- Crawls the listing pages, discovers discussion links like
- This yields
.md
files containing the original question and comments/replies.
- Python 3.7+
requests
,beautifulsoup4
,markdownify
Install all dependencies:
pip install requests beautifulsoup4 markdownify
Clone or download this repo locally.
In the repo directory, run:
python scraper.py
The script will crawl each site listed in SITES (inside scraper.py) and output the results under scraped_docs/.
Modifying the List of Sites Inside scraper.py, near the top, you’ll see:
python
SITES = [ "https://developers.flow.com/", "https://academy.ecdao.org/en/cadence-by-example", ... "https://github.com/onflow/flow-ft/", ... "https://github.com/orgs/onflow/discussions" ]
Add a docs site by appending its URL if it’s not on GitHub. Add a GitHub repo by appending the base URL (e.g. "https://github.com/onflow/another-repo"). Add another GitHub Discussions page if needed. Remove any site by deleting or commenting out its line. For private sites or repos, you may need authentication tokens/cookies to see content that’s not public.
Output Structure After a successful run, you’ll see:
scraped_docs/ ├─ developers_flow_com/ │ ├─ index.md │ ├─ docs_tutorial_somepage.md │ └─ ... ├─ github_com_onflow_flow_ft/ │ ├─ blob_main_contracts_exampletoken_cdc.md │ ├─ ... ├─ github_com_orgs_onflow_discussions/ │ ├─ discussion_1330.md │ ├─ discussion_1514.md │ └─ ... └─ ...
Docs directories for each site Repos with code files in .md (wrapped code blocks) Discussions as discussion_.md, each containing Q&A text.