你說:
Any github repository that extracts, modifies and save text based on an HTML file?
Yes — there are several open‑source GitHub repositories and libraries that can extract text from HTML, modify it, and save or output it. Here are some good options depending on language and use case:
HTML to Text Converter (html2text on GitHub) – A popular Python script that converts HTML into clean text (often Markdown). You can programmatically load HTML, extract text, modify it, and then save the result.
TeamHG-Memex/html-text – A Python library to extract text from HTML content; easy to integrate.
Beautiful Soup (not a single repo, but widely used) – Not a specific repo to share, but when searching GitHub there are many example scripts that use Beautiful Soup to parse, extract and then overwrite HTML files after modification.
jhy/jsoup – A very well‑known Java HTML parser. You can parse a file, traverse/modify DOM elements, extract text, and write back the modified HTML.
patrickbucher/htmlsqueeze – Extracts text from HTML DOM structures in Go.
k3a/html2text – A lightweight Go module to convert HTML → plain text programmatically.
kyr0/html-text-extractor – A minimal Node library that finds text sections from HTML files in folders (can be extended for modification/processing).
While not “full apps that extract and modify”, some other useful libraries you can build on:
HTMLAdapt (Python) – More advanced extraction + merge workflow (extracts text, merges back edited text while preserving structure).
Popular HTML text extraction libs like html2text, Inscriptis (Python), etc., which make it easy to get the text out and you can combine them with Python or other scripts to modify + save HTML.
Here’s how most of these typically work:
Load/Parse HTML
Use a DOM parser (e.g., BeautifulSoup, jsoup).
Extract Text
Traverse nodes, extract text strings or innerText.
Modify Content
Replace text nodes, append/change tags, etc.
Write Back
Output the modified HTML back to a file.
If you want, I can show a ready‑to‑use code example in Python/Java/JavaScript that does all that (extract → modify → save). Just tell me your preferred language!
Perfect! In Python, the most common approach is to use Beautiful Soup to parse HTML, modify text, and save it back. Here’s a full example workflow:
Pythonfrom bs4 import BeautifulSoup
# 1️⃣ Load HTML from a file
with open("example.html", "r", encoding="utf-8") as file:
html_content = file.read()
# 2️⃣ Parse HTML
soup = BeautifulSoup(html_content, "html.parser")
# 3️⃣ Extract all text (optional)
all_text = soup.get_text(separator="\n")
print("Original text:\n", all_text)
# 4️⃣ Modify specific elements
# Example: Replace all <p> text
for p_tag in soup.find_all("p"):
p_tag.string = p_tag.get_text().replace("old word", "new word")
# Example: Add a new paragraph
new_tag = soup.new_tag("p")
new_tag.string = "This is a newly added paragraph!"
soup.body.append(new_tag)
# 5️⃣ Save modified HTML back to a file
with open("modified_example.html", "w", encoding="utf-8") as file:
file.write(str(soup))
print("✅ HTML updated and saved as 'modified_example.html'")
BeautifulSoup parses your HTML so you can access elements like <p>, <div>, etc.
get_text() extracts all text, optionally separated by newlines.
find_all() lets you loop through tags to modify text.
new_tag() + append() lets you insert new content into the HTML tree.
Finally, you write the modified HTML back to a file.
If you want, I can also show a version that automatically modifies all text content in the HTML (not just <p> tags) in one sweep and saves it—perfect for bulk editing large HTML files.
Do you want me to do that?
Got it! You want to modify only the text content while keeping all HTML tags and structure intact. That’s exactly what Beautiful Soup’s NavigableString approach is for. Here’s a Python example that carefully updates text without touching other tags or attributes:
Pythonfrom bs4 import BeautifulSoup, NavigableString
# 1️⃣ Load HTML
with open("example.html", "r", encoding="utf-8") as file:
soup = BeautifulSoup(file, "html.parser")
# 2️⃣ Function to recursively update text nodes
def modify_text(node, replace_func):
"""
Recursively modify text inside a node.
node: BeautifulSoup element
replace_func: function to modify string, e.g., lambda s: s.replace("old", "new")
"""
for child in node.children:
if isinstance(child, NavigableString):
new_text = replace_func(str(child))
child.replace_with(new_text)
elif child.name is not None:
modify_text(child, replace_func)
# 3️⃣ Define your text modification function
def replace_words(text):
# Example: replace "old word" with "new word"
return text.replace("old word", "new word")
# 4️⃣ Apply modifications
modify_text(soup, replace_words)
# 5️⃣ Save modified HTML
with open("modified_example.html", "w", encoding="utf-8") as file:
file.write(str(soup))
print("✅ Text modified while preserving HTML structure.")
Only text nodes (NavigableString) are modified.
All HTML tags, attributes, and hierarchy remain untouched.
Works recursively, so <div>, <span>, <p>, <a> etc. are preserved exactly where they are.
Flexible: replace_words can implement any text transformation (regex, translation, capitalization, etc.).
If you want, I can write an even smarter version that handles text inside attributes like alt or title while still keeping HTML fully intact. This is useful for translating or modifying all visible text in a webpage.
Do you want me to do that next?
Ah! Now it’s crystal clear — you want to translate text inside certain tags only (like <p> or <h1>), while keeping all HTML tags and structure intact, without touching other content. We can do this by targeting specific tags with Beautiful Soup and only modifying their text content.
Here’s a robust Python example:
Pythonfrom bs4 import BeautifulSoup, NavigableString
# 1️⃣ Load HTML
with open("example.html", "r", encoding="utf-8") as file:
soup = BeautifulSoup(file, "html.parser")
# 2️⃣ Define which tags you want to translate
tags_to_translate = ["p", "h1", "h2", "span"] # example: translate <p>, <h1>, <h2>, <span>
# 3️⃣ Dummy translation function (replace with actual translation logic)
def translate_text(text):
# Example: just uppercase text to simulate translation
return text.upper()
# 4️⃣ Function to translate text inside target tags
def translate_tags(soup, target_tags, translate_func):
for tag_name in target_tags:
for tag in soup.find_all(tag_name):
# Only modify direct text nodes inside the tag
for child in tag.children:
if isinstance(child, NavigableString):
translated = translate_func(str(child))
child.replace_with(translated)
# 5️⃣ Apply translation
translate_tags(soup, tags_to_translate, translate_text)
# 6️⃣ Save modified HTML
with open("translated_example.html", "w", encoding="utf-8") as file:
file.write(str(soup))
print("✅ Selected tags translated while preserving HTML structure.")
Selective Tag Translation – Only the tags in tags_to_translate are modified. All other tags remain untouched.
Text-Only Modification – It modifies only the text inside the tag, not attributes or inner HTML structure.
HTML Structure Preserved – Nested tags, attributes, and the overall HTML hierarchy are untouched.
Flexible Translation Function – translate_text can be replaced with:
Google Translate API calls
DeepL API
Any custom NLP/text processing function
If you want, I can write an enhanced version that also:
Handles nested tags inside the target tag (like <p>Hello <b>world</b></p>)
Keeps inner tags intact while only translating visible text
This is the approach professional translators often use when localizing HTML pages.
Do you want me to create that advanced version?
Ah! Now I fully understand your workflow:
You want to extract all text from specific HTML tags into a file, so you can do translation externally and in context, then later re-insert the translated text into the original HTML, keeping all tags and structure intact.
Here’s a proper two-step workflow for that.
Pythonfrom bs4 import BeautifulSoup
# Load HTML
with open("example.html", "r", encoding="utf-8") as f:
soup = BeautifulSoup(f, "html.parser")
# Tags you want to translate
tags_to_translate = ["p", "h1", "h2", "span"]
# Collect text from target tags
extracted_texts = []
for tag_name in tags_to_translate:
for tag in soup.find_all(tag_name):
text = tag.get_text(strip=True)
if text: # ignore empty strings
extracted_texts.append(text)
# Save all text lines to a file for external translation
with open("text_to_translate.txt", "w", encoding="utf-8") as f:
for line in extracted_texts:
f.write(line + "\n")
print(f"✅ Extracted {len(extracted_texts)} lines to 'text_to_translate.txt'.")
Each line corresponds to one piece of text from your target tags.
You can translate this file externally (manual translation, DeepL, Google Sheets, etc.) without touching HTML.
Assume you saved your translated text line by line in translated_text.txt in the same order as the original extraction:
Pythonfrom bs4 import BeautifulSoup
# Load original HTML
with open("example.html", "r", encoding="utf-8") as f:
soup = BeautifulSoup(f, "html.parser")
# Load translated lines
with open("translated_text.txt", "r", encoding="utf-8") as f:
translated_lines = [line.strip() for line in f.readlines()]
# Tags to translate
tags_to_translate = ["p", "h1", "h2", "span"]
# Insert translated text back
idx = 0
for tag_name in tags_to_translate:
for tag in soup.find_all(tag_name):
text = tag.get_text(strip=True)
if text and idx < len(translated_lines):
# Replace only the text, keep nested tags intact
for child in tag.children:
if child.string and child.string.strip():
child.replace_with(translated_lines[idx])
idx += 1
break # only replace first direct string in tag
# Save modified HTML
with open("translated_example.html", "w", encoding="utf-8") as f:
f.write(str(soup))
print("✅ Translated text re-inserted into HTML.")
Original HTML structure is fully preserved (all tags, attributes, styles, and nested tags remain intact).
Text extraction is context-aware: you translate all content externally while seeing all sentences in order.
Flexible: you can adjust which tags you want to extract.
Line-by-line mapping ensures translations go back exactly where they belong.
If you want, I can provide a more robust version that also:
Handles nested tags properly (e.g., <p>Hello <b>world</b></p> → extracts "Hello " and "world" separately but keeps <b> intact when reinserting)
Ensures one-to-one mapping between extracted text and translated text without breaking nested tags
This is usually what production HTML localization pipelines use.
Do you want me to make that robust nested version?
Absolutely! You can combine extraction and re-insertion into a single workflow that:
Scans the HTML for the specific tags you care about.
Saves all text content into a file for external translation if it doesn’t already exist.
Reads the translated file and injects translations back into the same HTML, preserving all tags and structure.
This allows you to run the script multiple times: first to extract text, later to re-insert translations. Here’s a clean, safe implementation:
Pythonfrom bs4 import BeautifulSoup
from pathlib import Path
# Configuration
html_file = "example.html"
translated_file = "translated_text.txt"
output_file = "translated_example.html"
tags_to_translate = ["p", "h1", "h2", "span"]
# Load original HTML
with open(html_file, "r", encoding="utf-8") as f:
soup = BeautifulSoup(f, "html.parser")
# 1️⃣ Extract text if translated file doesn't exist
translated_path = Path(translated_file)
if not translated_path.exists():
extracted_texts = []
for tag_name in tags_to_translate:
for tag in soup.find_all(tag_name):
text = tag.get_text(strip=True)
if text:
extracted_texts.append(text)
# Save for translation
with open(translated_file, "w", encoding="utf-8") as f:
for line in extracted_texts:
f.write(line + "\n")
print(f"✅ Extracted {len(extracted_texts)} lines to '{translated_file}'.")
print("Translate this file externally and re-run the script to insert translations.")
else:
# 2️⃣ Re-insert translated text
with open(translated_file, "r", encoding="utf-8") as f:
translated_lines = [line.strip() for line in f.readlines()]
idx = 0
for tag_name in tags_to_translate:
for tag in soup.find_all(tag_name):
text = tag.get_text(strip=True)
if text and idx < len(translated_lines):
# Replace only the first direct text node
for child in tag.children:
if child.string and child.string.strip():
child.replace_with(translated_lines[idx])
idx += 1
break
# Save updated HTML
with open(output_file, "w", encoding="utf-8") as f:
f.write(str(soup))
print(f"✅ Translations inserted. Output saved to '{output_file}'.")
First Run:
Checks if translated_text.txt exists.
If not, extracts text from target tags and writes it for translation.
Second Run (or later):
Reads the translated file line by line.
Replaces text inside the same tags, leaving all HTML structure intact.
Safety & Flexibility:
You can adjust tags_to_translate to any set of tags.
Nested tags remain untouched as long as translation replaces only text nodes.
This setup is exactly what professional localization pipelines do: export for translation → translate externally → import translations, all without breaking the HTML.
I can also make an enhanced version that handles nested tags gracefully, so translations for inner <b> or <i> text also map correctly while keeping tags intact.
Do you want me to make that enhanced nested version?
Perfect! Here’s an enhanced version that handles nested tags properly.
Key improvements:
Every piece of text in the target tags, even inside nested tags, is extracted separately.
Translation mapping is one-to-one, so each extracted text has a corresponding translated line.
Original HTML tags, attributes, and nested structure are fully preserved.
Works in a single combined script: first run extracts texts, subsequent runs inject translations.
Pythonfrom bs4 import BeautifulSoup, NavigableString
from pathlib import Path
# ----------------- Configuration -----------------
html_file = "example.html"
translated_file = "translated_text.txt"
output_file = "translated_example.html"
tags_to_translate = ["p", "h1", "h2", "span"] # tags whose text you want to translate
# ----------------- Load HTML -----------------
with open(html_file, "r", encoding="utf-8") as f:
soup = BeautifulSoup(f, "html.parser")
# ----------------- Helper: recursively extract text nodes -----------------
def extract_text_nodes(tag, texts_list):
"""Recursively extract all NavigableStrings under a tag."""
for child in tag.descendants:
if isinstance(child, NavigableString) and child.strip():
texts_list.append(child.strip())
# ----------------- Helper: recursively insert translated text -----------------
def insert_translations(tag, translations_list):
"""Replace NavigableStrings under a tag with translated lines in order."""
for child in tag.descendants:
if isinstance(child, NavigableString) and child.strip():
if insert_translations.idx < len(translations_list):
child.replace_with(translations_list[insert_translations.idx])
insert_translations.idx += 1
# ----------------- Main Workflow -----------------
translated_path = Path(translated_file)
if not translated_path.exists():
# Step 1: Extract all text nodes from target tags
extracted_texts = []
for tag_name in tags_to_translate:
for tag in soup.find_all(tag_name):
extract_text_nodes(tag, extracted_texts)
# Save for translation
with open(translated_file, "w", encoding="utf-8") as f:
for line in extracted_texts:
f.write(line + "\n")
print(f"✅ Extracted {len(extracted_texts)} text pieces to '{translated_file}'.")
print("Translate this file externally and re-run the script to insert translations.")
else:
# Step 2: Insert translated text
with open(translated_file, "r", encoding="utf-8") as f:
translated_lines = [line.strip() for line in f.readlines()]
# Insert translations
insert_translations.idx = 0 # static variable for tracking
for tag_name in tags_to_translate:
for tag in soup.find_all(tag_name):
insert_translations(tag, translated_lines)
# Save modified HTML
with open(output_file, "w", encoding="utf-8") as f:
f.write(str(soup))
print(f"✅ Translations inserted. Output saved to '{output_file}'.")
Nested tags handled – e.g., <p>Hello <b>world</b>!</p> → extracts Hello and world separately.
Exact mapping – each extracted string corresponds to a line in your translation file.
All HTML preserved – <b>, <i>, <a> tags remain intact.
Single workflow – extraction and reinsertion in one script.
Flexible – simply update tags_to_translate for the tags you care about.
This is the standard way professional HTML localization is done.
If you want, I can also show a version that outputs context info like tag path or line number in the translation file, which makes it easier to translate long, complex HTML pages with repeated text.
Do you want me to add that context-enhanced version?