I want to extract all text from specific HTML tags into a file, so I can do translation externally and in context, then later re-insert the translated text into the original HTML, keeping all tags and structure intact. The intermediate file must be a plain text file in which extracted texts are separated by line feeds. In the final HTML, append the translated after the original text for comparison. Can you show me how to do it in one standalone Python script?
已經完成思考
Here is a complete, standalone Python script that handles both the extraction and the re-insertion (merge) processes.
It uses the
BeautifulSoup library for robust HTML parsing. While this requires one external dependency, it is the standard for ensuring HTML structure remains intact during modification.Prerequisites
You will need to install
beautifulsoup4 if you haven't already:bash 1pip install beautifulsoup4
The Script (html_translation_helper.py)
python 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172#!/usr/bin/env python3"""HTML Translation Helper ScriptThis script extracts text from specific HTML tags for translation andmerges the translated text back into the HTML structure.Workflow:1. EXTRACT: python html_translation_helper.py extract input.html texts_to_translate.txt2. TRANSLATE: Edit 'texts_to_translate.txt' externally (keep line order intact!).3. MERGE: python html_translation_helper.py merge input.html texts_to_translate.txt output.htmlDependencies:pip install beautifulsoup4"""import sysimport argparseimport osfrom pathlib import Pathtry:from bs4 import BeautifulSoupexcept ImportError:print("Error: The 'beautifulsoup4' library is required.")print("Please install it using: pip install beautifulsoup4")sys.exit(1)# Default tags to extract from.# Note: Avoid nested tags (e.g., extracting from both 'div' and 'p' inside it)# to prevent duplicate text extraction.DEFAULT_TAGS = "p,h1,h2,h3,h4,h5,h6,li,td,th,label,figcaption,dt,dd"def get_soup(file_path):"""Load HTML file into a BeautifulSoup object."""with open(file_path, 'r', encoding='utf-8') as f:content = f.read()# Using 'html.parser' is built-in. 'lxml' is faster but requires installation.return BeautifulSoup(content, 'html.parser')def extract_text(html_file, output_txt, tag_selector):"""Extracts text from specific tags and saves to a plain text file.Each extracted segment is a new line."""soup = get_soup(html_file)tags = tag_selector.split(',')# Find all matching tags in DOM orderelements = soup.find_all(tags)extracted_lines = []for el in elements:# get_text(strip=True) removes surrounding whitespace but keeps internal structure logic# We use strip=True to avoid translating empty lines caused by indentationtext = el.get_text(strip=True)# We extract even if empty to maintain line count synchronization,# but typically translators skip empty lines.# To be safe for line-matching, we record the line regardless.extracted_lines.append(text)# Write to filewith open(output_txt, 'w', encoding='utf-8') as f:for line in extracted_lines:f.write(line + '\n')print(f"[EXTRACT] Found {len(extracted_lines)} text segments.")print(f"[EXTRACT] Saved to: {output_txt}")def merge_text(html_file, txt_file, output_html, tag_selector):"""Reads translated lines and appends them to the original textwithin the HTML tags, preserving structure."""soup = get_soup(html_file)tags = tag_selector.split(',')# Read translated linesif not os.path.exists(txt_file):print(f"[ERROR] Translation file not found: {txt_file}")sys.exit(1)with open(txt_file, 'r', encoding='utf-8') as f:translated_lines = [line.rstrip('\n') for line in f.readlines()]# Find all matching tags in DOM order (Must match extraction order)elements = soup.find_all(tags)if len(elements) != len(translated_lines):print(f"[WARNING] Mismatch count! HTML tags: {len(elements)}, Text lines: {len(translated_lines)}")print("[WARNING] Ensure the HTML structure hasn't changed since extraction.")# We continue anyway, but will stop at the shorter of the two lists to avoid crashcount = min(len(elements), len(translated_lines))for i in range(count):el = elements[i]translation = translated_lines[i]# Skip if translation is empty (optional, but keeps HTML cleaner)if not translation.strip():continue# Get original text to ensure we aren't duplicating if script runs twice# However, per requirement, we simply append.# To prevent double-appending if script is run accidentally twice,# we could check for a marker, but we will trust the user workflow.# Append translation.# We add a space and wrap translation in a span for potential CSS styling later.# Example: Original [Translated: Hola]new_content = f" <span class='translated-text' title='Translation'>{translation}</span>"# Use append to add to the end of the tag's existing content childrenel.append(BeautifulSoup(new_content, 'html.parser'))# Write outputwith open(output_html, 'w', encoding='utf-8') as f:# Convert back to string. preserve original formatting as much as possible.f.write(soup.prettify())print(f"[MERGE] Processed {count} segments.")print(f"[MERGE] Saved to: {output_html}")def main():parser = argparse.ArgumentParser(description="Extract text from HTML for translation and merge back.",formatter_class=argparse.RawDescriptionHelpFormatter,epilog="""Examples:# Step 1: Extract textpython html_translation_helper.py extract source.html source_texts.txt# Step 2: Edit source_texts.txt externally# Step 3: Merge backpython html_translation_helper.py merge source.html source_texts.txt final.html""")subparsers = parser.add_subparsers(dest='command', help='Mode of operation')# Extract Commandparser_extract = subparsers.add_parser('extract', help='Extract text from HTML to TXT')parser_extract.add_argument('html_input', help='Input HTML file')parser_extract.add_argument('text_output', help='Output TXT file for translation')parser_extract.add_argument('--tags', default=DEFAULT_TAGS,help=f'Comma-separated HTML tags to extract (Default: {DEFAULT_TAGS})')# Merge Commandparser_merge = subparsers.add_parser('merge', help='Merge translated TXT back into HTML')parser_merge.add_argument('html_input', help='Original Input HTML file')parser_merge.add_argument('text_input', help='Translated TXT file')parser_merge.add_argument('html_output', help='Output HTML file with translations')parser_merge.add_argument('--tags', default=DEFAULT_TAGS,help=f'Comma-separated HTML tags to match (Default: {DEFAULT_TAGS})')args = parser.parse_args()if not args.command:parser.print_help()sys.exit(1)if args.command == 'extract':extract_text(args.html_input, args.text_output, args.tags)elif args.command == 'merge':merge_text(args.html_input, args.text_input, args.html_output, args.tags)if __name__ == '__main__':main()
How to Use It
1. Extract Phase
Run the script in
extract mode. This scans your HTML, finds the specific tags, and dumps the text into a file.bash 1python html_translation_helper.py extract website.html to_translate.txt
- Input:
website.html - Output:
to_translate.txt(Plain text, one line per HTML element found). - Tags: By default, it looks for paragraphs, headers, list items, and table cells. You can customize this with
--tags "p,span,div".
2. Translation Phase
Open
to_translate.txt in your text editor or send it to a translator.- Important: Do not change the order of the lines.
- Important: Do not add or remove lines. If a line was empty, keep it empty (or delete the text but keep the line break).
- Save the file as
translated.txt.
3. Merge Phase
Run the script in
merge mode. This reads the original HTML and the translated text file, then injects the translation.bash 1python html_translation_helper.py merge website.html translated.txt website_translated.html
- Logic: It finds the same tags in the same order as the extraction phase.
- Result: It appends the translated text inside the tag, wrapped in a
<span>for styling.- Before:
<p>Hello World</p> - After:
<p>Hello World <span class="translated-text" title="Translation">Hola Mundo</span></p>
- Before:
Key Features & Safeguards
- DOM Order: It relies on the Document Object Model order. As long as you don't change the HTML structure between Extraction and Merging, the lines will align perfectly.
- Structure Integrity: It uses
tag.append(), which adds content to the end of the tag without destroying existing child tags (like<b>,<a>, or<img>inside a paragraph). - UTF-8 Encoding: All file operations enforce
utf-8to support all languages. - CSS Hook: The translated text is wrapped in
<span class="translated-text">. You can add CSS to your site later to hide these (e.g.,.translated-text { display: none; }) once you are confident in the translation. - Mismatch Warning: If the number of lines in the text file doesn't match the number of tags in the HTML (e.g., you edited the HTML accidentally), the script will warn you before proceeding.
自動
人工智慧生成的內容可能不準確。