I want to extract all text from specific HTML tags into a file, so I can do translation externally and in context, then later re-insert the translated text into the original HTML, keeping all tags and structure intact. The intermediate file must be a plain text file in which extracted texts are separated by line feeds. In the final HTML, append the translated after the original text for comparison. Can you show me how to do it in one standalone Python script?

已經完成思考
Here is a complete, standalone Python script that handles both the extraction and the re-insertion (merge) processes.
It uses the BeautifulSoup library for robust HTML parsing. While this requires one external dependency, it is the standard for ensuring HTML structure remains intact during modification.

Prerequisites

You will need to install beautifulsoup4 if you haven't already:
bash

The Script (html_translation_helper.py)

python

How to Use It

1. Extract Phase

Run the script in extract mode. This scans your HTML, finds the specific tags, and dumps the text into a file.
bash
  • Input: website.html
  • Output: to_translate.txt (Plain text, one line per HTML element found).
  • Tags: By default, it looks for paragraphs, headers, list items, and table cells. You can customize this with --tags "p,span,div".

2. Translation Phase

Open to_translate.txt in your text editor or send it to a translator.
  • Important: Do not change the order of the lines.
  • Important: Do not add or remove lines. If a line was empty, keep it empty (or delete the text but keep the line break).
  • Save the file as translated.txt.

3. Merge Phase

Run the script in merge mode. This reads the original HTML and the translated text file, then injects the translation.
bash
  • Logic: It finds the same tags in the same order as the extraction phase.
  • Result: It appends the translated text inside the tag, wrapped in a <span> for styling.
    • Before: <p>Hello World</p>
    • After: <p>Hello World <span class="translated-text" title="Translation">Hola Mundo</span></p>

Key Features & Safeguards

  1. DOM Order: It relies on the Document Object Model order. As long as you don't change the HTML structure between Extraction and Merging, the lines will align perfectly.
  2. Structure Integrity: It uses tag.append(), which adds content to the end of the tag without destroying existing child tags (like <b>, <a>, or <img> inside a paragraph).
  3. UTF-8 Encoding: All file operations enforce utf-8 to support all languages.
  4. CSS Hook: The translated text is wrapped in <span class="translated-text">. You can add CSS to your site later to hide these (e.g., .translated-text { display: none; }) once you are confident in the translation.
  5. Mismatch Warning: If the number of lines in the text file doesn't match the number of tags in the HTML (e.g., you edited the HTML accidentally), the script will warn you before proceeding.
自動
人工智慧生成的內容可能不準確。