Qwen Chat

I want to extract all text from specific HTML tags into a file, so I can do translation externally and in context, then later re-insert the translated text into the original HTML, keeping all tags and structure intact. The intermediate file must be a plain text file in which extracted texts are separated by line feeds. In the final HTML, append the translated after the original text for comparison. Can you show me how to do it in one standalone Python script?

已經完成思考

Here is a complete, standalone Python script that handles both the extraction and the re-insertion (merge) processes.

It uses the BeautifulSoup library for robust HTML parsing. While this requires one external dependency, it is the standard for ensuring HTML structure remains intact during modification.

Prerequisites

You will need to install beautifulsoup4 if you haven't already:

bash
1
pip install beautifulsoup4

The Script (`html_translation_helper.py`)

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
#!/usr/bin/env python3
"""
HTML Translation Helper Script
This script extracts text from specific HTML tags for translation and 
merges the translated text back into the HTML structure.
Workflow:
1. EXTRACT: python html_translation_helper.py extract input.html texts_to_translate.txt
2. TRANSLATE: Edit 'texts_to_translate.txt' externally (keep line order intact!).
3. MERGE:    python html_translation_helper.py merge input.html texts_to_translate.txt output.html
Dependencies:
    pip install beautifulsoup4
"""
import sys
import argparse
import os
from pathlib import Path
try:
    from bs4 import BeautifulSoup
except ImportError:
    print("Error: The 'beautifulsoup4' library is required.")
    print("Please install it using: pip install beautifulsoup4")
    sys.exit(1)
# Default tags to extract from. 
# Note: Avoid nested tags (e.g., extracting from both 'div' and 'p' inside it) 
# to prevent duplicate text extraction.
DEFAULT_TAGS = "p,h1,h2,h3,h4,h5,h6,li,td,th,label,figcaption,dt,dd"
def get_soup(file_path):
    """Load HTML file into a BeautifulSoup object."""
    with open(file_path, 'r', encoding='utf-8') as f:
        content = f.read()
    # Using 'html.parser' is built-in. 'lxml' is faster but requires installation.
    return BeautifulSoup(content, 'html.parser')
def extract_text(html_file, output_txt, tag_selector):
    """
    Extracts text from specific tags and saves to a plain text file.
    Each extracted segment is a new line.
    """
    soup = get_soup(html_file)
    tags = tag_selector.split(',')
    
    # Find all matching tags in DOM order
    elements = soup.find_all(tags)
    
    extracted_lines = []
    
    for el in elements:
        # get_text(strip=True) removes surrounding whitespace but keeps internal structure logic
        # We use strip=True to avoid translating empty lines caused by indentation
        text = el.get_text(strip=True)
        
        # We extract even if empty to maintain line count synchronization, 
        # but typically translators skip empty lines. 
        # To be safe for line-matching, we record the line regardless.
        extracted_lines.append(text)
    
    # Write to file
    with open(output_txt, 'w', encoding='utf-8') as f:
        for line in extracted_lines:
            f.write(line + '\n')
            
    print(f"[EXTRACT] Found {len(extracted_lines)} text segments.")
    print(f"[EXTRACT] Saved to: {output_txt}")
def merge_text(html_file, txt_file, output_html, tag_selector):
    """
    Reads translated lines and appends them to the original text 
    within the HTML tags, preserving structure.
    """
    soup = get_soup(html_file)
    tags = tag_selector.split(',')
    
    # Read translated lines
    if not os.path.exists(txt_file):
        print(f"[ERROR] Translation file not found: {txt_file}")
        sys.exit(1)
        
    with open(txt_file, 'r', encoding='utf-8') as f:
        translated_lines = [line.rstrip('\n') for line in f.readlines()]
    
    # Find all matching tags in DOM order (Must match extraction order)
    elements = soup.find_all(tags)
    
    if len(elements) != len(translated_lines):
        print(f"[WARNING] Mismatch count! HTML tags: {len(elements)}, Text lines: {len(translated_lines)}")
        print("[WARNING] Ensure the HTML structure hasn't changed since extraction.")
        # We continue anyway, but will stop at the shorter of the two lists to avoid crash
        
    count = min(len(elements), len(translated_lines))
    
    for i in range(count):
        el = elements[i]
        translation = translated_lines[i]
        
        # Skip if translation is empty (optional, but keeps HTML cleaner)
        if not translation.strip():
            continue
            
        # Get original text to ensure we aren't duplicating if script runs twice
        # However, per requirement, we simply append. 
        # To prevent double-appending if script is run accidentally twice, 
        # we could check for a marker, but we will trust the user workflow.
        
        # Append translation. 
        # We add a space and wrap translation in a span for potential CSS styling later.
        # Example: Original [Translated: Hola]
        new_content = f" <span class='translated-text' title='Translation'>{translation}</span>"
        
        # Use append to add to the end of the tag's existing content children
        el.append(BeautifulSoup(new_content, 'html.parser'))
        
    # Write output
    with open(output_html, 'w', encoding='utf-8') as f:
        # Convert back to string. preserve original formatting as much as possible.
        f.write(soup.prettify())
        
    print(f"[MERGE] Processed {count} segments.")
    print(f"[MERGE] Saved to: {output_html}")
def main():
    parser = argparse.ArgumentParser(
        description="Extract text from HTML for translation and merge back.",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
  # Step 1: Extract text
  python html_translation_helper.py extract source.html source_texts.txt
  # Step 2: Edit source_texts.txt externally
  # Step 3: Merge back
  python html_translation_helper.py merge source.html source_texts.txt final.html
        """
    )
    
    subparsers = parser.add_subparsers(dest='command', help='Mode of operation')
    
    # Extract Command
    parser_extract = subparsers.add_parser('extract', help='Extract text from HTML to TXT')
    parser_extract.add_argument('html_input', help='Input HTML file')
    parser_extract.add_argument('text_output', help='Output TXT file for translation')
    parser_extract.add_argument('--tags', default=DEFAULT_TAGS, 
                                help=f'Comma-separated HTML tags to extract (Default: {DEFAULT_TAGS})')
    
    # Merge Command
    parser_merge = subparsers.add_parser('merge', help='Merge translated TXT back into HTML')
    parser_merge.add_argument('html_input', help='Original Input HTML file')
    parser_merge.add_argument('text_input', help='Translated TXT file')
    parser_merge.add_argument('html_output', help='Output HTML file with translations')
    parser_merge.add_argument('--tags', default=DEFAULT_TAGS, 
                              help=f'Comma-separated HTML tags to match (Default: {DEFAULT_TAGS})')
    
    args = parser.parse_args()
    
    if not args.command:
        parser.print_help()
        sys.exit(1)
        
    if args.command == 'extract':
        extract_text(args.html_input, args.text_output, args.tags)
    elif args.command == 'merge':
        merge_text(args.html_input, args.text_input, args.html_output, args.tags)
if __name__ == '__main__':
    main()
#!/usr/bin/env python3
"""
HTML Translation Helper Script

This script extracts text from specific HTML tags for translation and 
merges the translated text back into the HTML structure.

Workflow:
1. EXTRACT: python html_translation_helper.py extract input.html texts_to_translate.txt
2. TRANSLATE: Edit 'texts_to_translate.txt' externally (keep line order intact!).

How to Use It

1. Extract Phase

Run the script in extract mode. This scans your HTML, finds the specific tags, and dumps the text into a file.

bash
1
python html_translation_helper.py extract website.html to_translate.txt
python html_translation_helper.py extract website.html to_translate.txt

Input: website.html
Output: to_translate.txt (Plain text, one line per HTML element found).
Tags: By default, it looks for paragraphs, headers, list items, and table cells. You can customize this with --tags "p,span,div".

2. Translation Phase

Open to_translate.txt in your text editor or send it to a translator.

Important: Do not change the order of the lines.
Important: Do not add or remove lines. If a line was empty, keep it empty (or delete the text but keep the line break).
Save the file as translated.txt.

3. Merge Phase

Run the script in merge mode. This reads the original HTML and the translated text file, then injects the translation.

bash
1
python html_translation_helper.py merge website.html translated.txt website_translated.html
python html_translation_helper.py merge website.html translated.txt website_translated.html

Logic: It finds the same tags in the same order as the extraction phase.
Result: It appends the translated text inside the tag, wrapped in a  for styling.
- Before: Hello World
- After: Hello World Hola Mundo

Key Features & Safeguards

DOM Order: It relies on the Document Object Model order. As long as you don't change the HTML structure between Extraction and Merging, the lines will align perfectly.
Structure Integrity: It uses tag.append(), which adds content to the end of the tag without destroying existing child tags (like , <a>, or <img> inside a paragraph).
UTF-8 Encoding: All file operations enforce utf-8 to support all languages.
CSS Hook: The translated text is wrapped in . You can add CSS to your site later to hide these (e.g., .translated-text { display: none; }) once you are confident in the translation.
Mismatch Warning: If the number of lines in the text file doesn't match the number of tags in the HTML (e.g., you edited the HTML accidentally), the script will warn you before proceeding.

自動

人工智慧生成的內容可能不準確。