Files
blog-extractor/README.md
Gabriel Kaszewski 56306a4852 Add initial project structure with README, Cargo configuration, and main logic
- Create .gitignore to exclude target and output directories
- Initialize Cargo.toml with project metadata and dependencies
- Add README.md with project description, features, installation, and usage instructions
- Implement main.rs for extracting blog posts from HTML files and exporting to JSON and text formats
2025-12-11 21:22:31 +01:00

91 lines
2.5 KiB
Markdown

# Blog Extractor
A Rust tool that extracts blog posts from HTML files and exports them in multiple formats (JSON and plain text).
I just needed a simple script that would extract data from blogger.com
## Features
- **HTML Parsing**: Extracts blog post data (title, date, content) from HTML files using CSS selectors
- **JSON Export**: Saves each blog post as an individual JSON file for easy programmatic access
- **Combined Text Export**: Generates a single text file containing all posts in a human-readable format
- **Batch Processing**: Processes all HTML files in a directory in one run
- **Error Handling**: Gracefully handles processing errors without stopping the entire operation
## How It Works
The tool scans a specified directory for HTML files and extracts:
- **Title**: From the `h3.post-title` element
- **Date**: From the `time.published` element
- **Content**: From the `.post-body.entry-content` element
- **Source File**: The original filename for reference
Each post is saved in two formats:
1. Individual JSON files in `output/json/` with the same filename as the source HTML
2. A combined text file at `output/all_posts_combined.txt` containing all posts
## Installation
Make sure you have Rust installed. If not, visit [rustup.rs](https://rustup.rs/).
Clone the repository:
```bash
git clone https://git.gabrielkaszewski.dev/GKaszewski/blog-extractor.git
cd blog-extractor
```
## Usage
Run the tool with a directory path containing HTML files:
```bash
cargo run -- <directory_path>
```
Example:
```bash
cargo run -- ./blog_html_files
```
### Output
After running, you'll find:
- `output/json/` - Individual JSON files for each blog post
- `output/all_posts_combined.txt` - All posts combined in text format
## Dependencies
- **scraper** - HTML parsing and CSS selector support
- **serde** - Serialization framework
- **serde_json** - JSON serialization
## Requirements
The HTML files should contain:
- A title in an `<h3 class="post-title">` element
- A publish date in a `<time class="published">` element
- Post content in a `<div class="post-body entry-content">` element
## Project Structure
```
blog-extractor/
├── src/
│ └── main.rs # Main application logic
├── output/
│ ├── json/ # Individual JSON files
│ └── all_posts_combined.txt # Combined text file
├── Cargo.toml # Project configuration
└── README.md # This file
```
## License
This project is open source and available under the MIT License.