Gabriel Kaszewski 56306a4852 Add initial project structure with README, Cargo configuration, and main logic
- Create .gitignore to exclude target and output directories
- Initialize Cargo.toml with project metadata and dependencies
- Add README.md with project description, features, installation, and usage instructions
- Implement main.rs for extracting blog posts from HTML files and exporting to JSON and text formats
2025-12-11 21:22:31 +01:00

Blog Extractor

A Rust tool that extracts blog posts from HTML files and exports them in multiple formats (JSON and plain text).

I just needed a simple script that would extract data from blogger.com

Features

  • HTML Parsing: Extracts blog post data (title, date, content) from HTML files using CSS selectors
  • JSON Export: Saves each blog post as an individual JSON file for easy programmatic access
  • Combined Text Export: Generates a single text file containing all posts in a human-readable format
  • Batch Processing: Processes all HTML files in a directory in one run
  • Error Handling: Gracefully handles processing errors without stopping the entire operation

How It Works

The tool scans a specified directory for HTML files and extracts:

  • Title: From the h3.post-title element
  • Date: From the time.published element
  • Content: From the .post-body.entry-content element
  • Source File: The original filename for reference

Each post is saved in two formats:

  1. Individual JSON files in output/json/ with the same filename as the source HTML
  2. A combined text file at output/all_posts_combined.txt containing all posts

Installation

Make sure you have Rust installed. If not, visit rustup.rs.

Clone the repository:

git clone https://git.gabrielkaszewski.dev/GKaszewski/blog-extractor.git
cd blog-extractor

Usage

Run the tool with a directory path containing HTML files:

cargo run -- <directory_path>

Example:

cargo run -- ./blog_html_files

Output

After running, you'll find:

  • output/json/ - Individual JSON files for each blog post
  • output/all_posts_combined.txt - All posts combined in text format

Dependencies

  • scraper - HTML parsing and CSS selector support
  • serde - Serialization framework
  • serde_json - JSON serialization

Requirements

The HTML files should contain:

  • A title in an <h3 class="post-title"> element
  • A publish date in a <time class="published"> element
  • Post content in a <div class="post-body entry-content"> element

Project Structure

blog-extractor/
├── src/
│   └── main.rs              # Main application logic
├── output/
│   ├── json/                # Individual JSON files
│   └── all_posts_combined.txt # Combined text file
├── Cargo.toml               # Project configuration
└── README.md                # This file

License

This project is open source and available under the MIT License.

Description
No description provided
Readme 30 KiB
Languages
Rust 100%