blog-extractor/README.md

# Blog Extractor

A Rust tool that extracts blog posts from HTML files and exports them in multiple formats (JSON and plain text).

I just needed a simple script that would extract data from blogger.com

## Features

- **HTML Parsing**: Extracts blog post data (title, date, content) from HTML files using CSS selectors
- **JSON Export**: Saves each blog post as an individual JSON file for easy programmatic access
- **Combined Text Export**: Generates a single text file containing all posts in a human-readable format
- **Batch Processing**: Processes all HTML files in a directory in one run
- **Error Handling**: Gracefully handles processing errors without stopping the entire operation

## How It Works

The tool scans a specified directory for HTML files and extracts:

- **Title**: From the `h3.post-title` element
- **Date**: From the `time.published` element
- **Content**: From the `.post-body.entry-content` element
- **Source File**: The original filename for reference

Each post is saved in two formats:

1. Individual JSON files in `output/json/` with the same filename as the source HTML
2. A combined text file at `output/all_posts_combined.txt` containing all posts

## Installation

Make sure you have Rust installed. If not, visit [rustup.rs](https://rustup.rs/).

Clone the repository:

```bash
git clone https://git.gabrielkaszewski.dev/GKaszewski/blog-extractor.git
cd blog-extractor
```

## Usage

Run the tool with a directory path containing HTML files:

```bash
cargo run -- <directory_path>
```

Example:

```bash
cargo run -- ./blog_html_files
```

### Output

After running, you'll find:

- `output/json/` - Individual JSON files for each blog post
- `output/all_posts_combined.txt` - All posts combined in text format

## Dependencies

- **scraper** - HTML parsing and CSS selector support
- **serde** - Serialization framework
- **serde_json** - JSON serialization

## Requirements

The HTML files should contain:

- A title in an `<h3 class="post-title">` element
- A publish date in a `<time class="published">` element
- Post content in a `<div class="post-body entry-content">` element

## Project Structure

```
blog-extractor/
├── src/
│   └── main.rs              # Main application logic
├── output/
│   ├── json/                # Individual JSON files
│   └── all_posts_combined.txt # Combined text file
├── Cargo.toml               # Project configuration
└── README.md                # This file
```

## License

This project is open source and available under the MIT License.