Add initial project structure with README, Cargo configuration, and main logic
- Create .gitignore to exclude target and output directories - Initialize Cargo.toml with project metadata and dependencies - Add README.md with project description, features, installation, and usage instructions - Implement main.rs for extracting blog posts from HTML files and exporting to JSON and text formats
This commit is contained in:
90
README.md
Normal file
90
README.md
Normal file
@@ -0,0 +1,90 @@
|
||||
# Blog Extractor
|
||||
|
||||
A Rust tool that extracts blog posts from HTML files and exports them in multiple formats (JSON and plain text).
|
||||
|
||||
I just needed a simple script that would extract data from blogger.com
|
||||
|
||||
## Features
|
||||
|
||||
- **HTML Parsing**: Extracts blog post data (title, date, content) from HTML files using CSS selectors
|
||||
- **JSON Export**: Saves each blog post as an individual JSON file for easy programmatic access
|
||||
- **Combined Text Export**: Generates a single text file containing all posts in a human-readable format
|
||||
- **Batch Processing**: Processes all HTML files in a directory in one run
|
||||
- **Error Handling**: Gracefully handles processing errors without stopping the entire operation
|
||||
|
||||
## How It Works
|
||||
|
||||
The tool scans a specified directory for HTML files and extracts:
|
||||
|
||||
- **Title**: From the `h3.post-title` element
|
||||
- **Date**: From the `time.published` element
|
||||
- **Content**: From the `.post-body.entry-content` element
|
||||
- **Source File**: The original filename for reference
|
||||
|
||||
Each post is saved in two formats:
|
||||
|
||||
1. Individual JSON files in `output/json/` with the same filename as the source HTML
|
||||
2. A combined text file at `output/all_posts_combined.txt` containing all posts
|
||||
|
||||
## Installation
|
||||
|
||||
Make sure you have Rust installed. If not, visit [rustup.rs](https://rustup.rs/).
|
||||
|
||||
Clone the repository:
|
||||
|
||||
```bash
|
||||
git clone https://git.gabrielkaszewski.dev/GKaszewski/blog-extractor.git
|
||||
cd blog-extractor
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
Run the tool with a directory path containing HTML files:
|
||||
|
||||
```bash
|
||||
cargo run -- <directory_path>
|
||||
```
|
||||
|
||||
Example:
|
||||
|
||||
```bash
|
||||
cargo run -- ./blog_html_files
|
||||
```
|
||||
|
||||
### Output
|
||||
|
||||
After running, you'll find:
|
||||
|
||||
- `output/json/` - Individual JSON files for each blog post
|
||||
- `output/all_posts_combined.txt` - All posts combined in text format
|
||||
|
||||
## Dependencies
|
||||
|
||||
- **scraper** - HTML parsing and CSS selector support
|
||||
- **serde** - Serialization framework
|
||||
- **serde_json** - JSON serialization
|
||||
|
||||
## Requirements
|
||||
|
||||
The HTML files should contain:
|
||||
|
||||
- A title in an `<h3 class="post-title">` element
|
||||
- A publish date in a `<time class="published">` element
|
||||
- Post content in a `<div class="post-body entry-content">` element
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
blog-extractor/
|
||||
├── src/
|
||||
│ └── main.rs # Main application logic
|
||||
├── output/
|
||||
│ ├── json/ # Individual JSON files
|
||||
│ └── all_posts_combined.txt # Combined text file
|
||||
├── Cargo.toml # Project configuration
|
||||
└── README.md # This file
|
||||
```
|
||||
|
||||
## License
|
||||
|
||||
This project is open source and available under the MIT License.
|
||||
Reference in New Issue
Block a user