# Blog Extractor A Rust tool that extracts blog posts from HTML files and exports them in multiple formats (JSON and plain text). I just needed a simple script that would extract data from blogger.com ## Features - **HTML Parsing**: Extracts blog post data (title, date, content) from HTML files using CSS selectors - **JSON Export**: Saves each blog post as an individual JSON file for easy programmatic access - **Combined Text Export**: Generates a single text file containing all posts in a human-readable format - **Batch Processing**: Processes all HTML files in a directory in one run - **Error Handling**: Gracefully handles processing errors without stopping the entire operation ## How It Works The tool scans a specified directory for HTML files and extracts: - **Title**: From the `h3.post-title` element - **Date**: From the `time.published` element - **Content**: From the `.post-body.entry-content` element - **Source File**: The original filename for reference Each post is saved in two formats: 1. Individual JSON files in `output/json/` with the same filename as the source HTML 2. A combined text file at `output/all_posts_combined.txt` containing all posts ## Installation Make sure you have Rust installed. If not, visit [rustup.rs](https://rustup.rs/). Clone the repository: ```bash git clone https://git.gabrielkaszewski.dev/GKaszewski/blog-extractor.git cd blog-extractor ``` ## Usage Run the tool with a directory path containing HTML files: ```bash cargo run -- ``` Example: ```bash cargo run -- ./blog_html_files ``` ### Output After running, you'll find: - `output/json/` - Individual JSON files for each blog post - `output/all_posts_combined.txt` - All posts combined in text format ## Dependencies - **scraper** - HTML parsing and CSS selector support - **serde** - Serialization framework - **serde_json** - JSON serialization ## Requirements The HTML files should contain: - A title in an `

` element - A publish date in a `