Python: Parse Docx To Html With Processing Styles

Name: Python: Parse Docx To Html With Processing Styles | Neculai Fantanaru (en)
Brand: Neculai Fantanaru
SKU: NFL
Availability: OnlineOnly
Rating: 5 (55 reviews)

On September 21, 2023

, in

Python Scripts Examples by Neculai Fantanaru

You can view the entire code here: https://pastebin.com/T7iF3X3S

Let's say I have a file bebe.docx that contains 10 articles translated into English, with the Title and Body of the Article.

The code will open the file bebe.docx, pass the data through a template index.html and then it will save each individual article in html format, with the file name according to Article Title. And here you will find other changes in the html tags...

import os
import re
import unidecode
from docx import Document


# Funcția pentru adăugarea conținutului la tag-ul meta
def add_content_to_meta(html_content, content_to_add):
    meta_pattern = r'<meta name="description" content="(.*?)">'
    match = re.search(meta_pattern, html_content)
    if match:
        old_meta_tag = match.group(0)
        new_content = re.sub(r'<em>(.*?)</em>', r'\1', content_to_add)  # Elimină tag-urile <em>
        new_meta_tag = f'<meta name="description" content="{new_content}">'
        updated_html_content = html_content.replace(old_meta_tag, new_meta_tag)
        return updated_html_content
    else:
        return html_content


# Funcția pentru conversia stilurilor paragrafelor din DOCX în HTML
def extract_data_from_docx(file_path):
    doc = Document(file_path)
    content = ""

    # Parcurge toate paragrafele din document și le adaugă la conținut
    for paragraph in doc.paragraphs:
        content += paragraph.text + "\n"

    # Utilizăm regex pentru a găsi titlurile și corpul fiecărui articol
    articles = re.split(r'^([A-Z][\w\s’\-\(\)]+)$', content, flags=re.MULTILINE | re.DOTALL)
    articles = [article.strip() for article in articles if article.strip()]

    data = []

    # Iterăm prin lista de articole pentru a extrage titlul și corpul corespunzător
    for i in range(0, len(articles), 2):
        title = articles[i]
        body = articles[i + 1].strip().split("\n")
        data.append((title, body))

    return data



def convert_docx_to_html_style(para):
    result = ""

    if para.runs:
        contains_bold = any(run.bold for run in para.runs)
        contains_italic = any(run.italic for run in para.runs)
        contains_regular = any(not run.bold and not run.italic for run in para.runs)

        if contains_bold and contains_italic and contains_regular:
            # Cazul 4: Textul conține atât bold, cât și italic și text regulat
            html_para = '<p class="text_obisnuit">'
            is_bold = False
            is_italic = False
            for run in para.runs:
                if run.bold and not is_bold:
                    html_para += '<span class="text_obisnuit2">'
                    is_bold = True
                if run.italic and not is_italic:
                    html_para += '<em>'
                    is_italic = True
                elif not run.bold and is_bold:
                    html_para += '</span>'
                    is_bold = False
                elif not run.italic and is_italic:
                    html_para += '</em>'
                    is_italic = False

                html_para += run.text

            if is_bold:
                html_para += '</span>'
            if is_italic:
                html_para += '</em>'

            html_para += '</p>\n'

        elif contains_bold and contains_regular:
            # Cazul 3: Textul conține atât bold, cât și text regulat
            html_para = '<p class="text_obisnuit">'
            is_bold = False
            for run in para.runs:
                if run.bold and not is_bold:
                    html_para += '<span class="text_obisnuit2">'
                    is_bold = True
                elif not run.bold and is_bold:
                    html_para += '</span>'
                    is_bold = False

                html_para += run.text

            if is_bold:
                html_para += '</span>'

            html_para += '</p>\n'

        elif contains_bold:
            # Cazul 2: Textul conține doar bold
            html_para = '<p class="text_obisnuit2">'
            for run in para.runs:
                html_para += run.text
            html_para += '</p>\n'

        elif contains_italic:
            # Cazul 5: Textul conține doar italic
            html_para = '<p class="text_obisnuit">'
            is_italic = False
            for run in para.runs:
                if run.italic and not is_italic:
                    html_para += '<em>'
                    is_italic = True
                elif not run.italic and is_italic:
                    html_para += '</em>'
                    is_italic = False

                html_para += run.text

            if is_italic:
                html_para += '</em>'

            html_para += '</p>\n'

        else:
            # Cazul 1: Textul este regulat (fără bold sau italic)
            html_para = '<p class="text_obisnuit">'
            for run in para.runs:
                html_para += run.text
            html_para += '</p>\n'

        result += html_para

    return result


def generate_filename(title):
    # Îndepărtează spațiile și caracterele speciale din titlu
    title = title.strip()
    title = re.sub(r'’s|’t', '', title)
    title = re.sub(r'[^a-zA-Z0-9\s]', '', title)

    # Transformă titlul în litere mici
    title = title.lower()

    title = title.replace(' ', '-')
    # Asigură că numele fișierului are o extensie .html
    filename = title + ".html"
    print("Filename:", filename)  # Afișează numele fișierului
    return filename

def update_and_save_html(title, body, template_path, output_directory):
    with open(template_path, "r", encoding="utf-8") as f:
        html_content = f.read()

    # Eliminarea sufixului " | Neculai Fantanaru (en)" din titlu
    title_without_suffix = title.replace(" | Neculai Fantanaru (en)", "")

    # Înlocuirea titlului în tagul <h1> cu titlul fără sufix
    html_content = html_content.replace('<h1 class="den_articol" itemprop="name">XXX</h1>', f'<h1 class="den_articol" itemprop="name">{title_without_suffix}</h1>')

    # Înlocuirea titlului în conținutul HTML
    new_title = title + " | Neculai Fantanaru (en)"
    html_content = html_content.replace("XXX", new_title, 2)
    print("Titlul a fost înlocuit.")

    # Găsirea poziției de început și sfârșit a blocului 
    start_marker = "<!-- ARTICOL START -->"
    end_marker = "<!-- ARTICOL FINAL -->"
    start = html_content.find(start_marker)
    end = html_content.find(end_marker)

    if start != -1 and end != -1:
        # Înlocuirea blocului  cu conținutul articolului, respectând formatarea cerută
        article_content = ""
        first_sentence = True
        for line in body:
            line = line.strip()
            # Începe prima propoziție cu <p class="text_obisnuit2">
            if first_sentence:
                article_content += f'\n\t<p class="text_obisnuit2"><em>{line}</em></p>'
                first_sentence = False
            elif line.startswith("Leadership:"):
                article_content += f'\n\t<p class="text_obisnuit2">{line}</p>'
            else:
                article_content += f'\n\t<p class="text_obisnuit">{line}</p>'
        html_content = html_content[:start + len(start_marker)] + article_content + html_content[end:]
        print("Corpul articolului a fost înlocuit conform cerințelor.")

    # Integrarea stilurilor de paragrafe din DOCX în HTML
    doc = Document("bebe.docx")
    for para in doc.paragraphs:
        html_style = convert_docx_to_html_style(para)
        html_content = html_content.replace("<p class=\"text_obisnuit\">{}</p>".format(para.text), html_style)

    # Eliminarea caracterelor speciale folosind unidecode
    html_content = unidecode.unidecode(html_content)

    # Crearea numelui de fișier bazat pe titlu
    filename = generate_filename(title)

    # Înlocuirea "zzz.html" cu numele fișierului generat în fiecare fișier HTML
    html_content = html_content.replace("zzz.html", filename)

    # Salvarea modificărilor în UTF-8
    with open(os.path.join(output_directory, filename), "w", encoding="utf-8") as f:
        f.write(html_content)
    print(f"Fișierul a fost salvat ca {filename}")

# Extragem datele din bebe.docx
articles_data = extract_data_from_docx("bebe.docx")

# Specifică directorul în care dorești să salvezi fișierele HTML
output_directory = "output"

# Actualizăm și salvăm fiecare articol într-un fișier HTML separat
for title, body in articles_data:
    update_and_save_html(title, body, "index.html", output_directory)

 # Deschide fișierul HTML pentru actualizare
    html_file_path = os.path.join(output_directory, generate_filename(title))
    with open(html_file_path, "r", encoding="utf-8") as html_file:
        html_content = html_file.read()

    # Colectează conținutul paragrafelor cu clasa "text_obisnuit2"
    content_to_add = ""
    paragraphs = re.findall(r'<p class="text_obisnuit2">(.*?)</p>', html_content)
    for paragraph in paragraphs:
        content_to_add += paragraph.strip() + ' '

    # Elimină tag-urile <em> din conținutul adăugat
    content_to_add_cleaned = re.sub(r'<em>(.*?)</em>', r'\1', content_to_add)

    # Adaugă conținutul curățat la tag-ul meta
    updated_html_content = add_content_to_meta(html_content, content_to_add_cleaned)

    # Actualizează fișierul HTML cu conținutul adăugat
    with open(html_file_path, "w", encoding="utf-8") as updated_html_file:
        updated_html_file.write(updated_html_content)

print("Conținutul a fost adăugat la tag-ul meta în fiecare fișier HTML, eliminând tag-urile <em>.")

That's all folks.

Also, see my other Python Scripts ---HERE---

Alatura-te Comunitatii Neculai Fantanaru

The 63 Greatest Qualities of a Leader

Why read this book? Because it is critical to optimizing your performance. Because it reveals the main coordinates after that are build the character and skills of the leaders, highlighting what it is important for them to increase their influence.

The essential characteristic of this book in comparison with others on the market in the same domain is that it describes through examples the ideal competences of a leader. I never claimed that it's easy to become a good leader, but if people will...

For some leaders, "leading" resembles more to a chess game, a game of cleverness and perspicacity; for others it means a game of chance, a game they think they can win every time risking and betting everything on a single card.

I wrote this book that conjoins in a simple way personal development with leadership, just like a puzzle, where you have to match all the given pieces in order to recompose the general image.

The aim of this book is to offer you information through concrete examples and to show you how to obtain the capacity to make others see things from the same angle as you.

Without considering it a concord, the book is representing the try of an ordinary man - the author - who through simple words, facts and usual examples instills to the ordinary man courage and optimism in his own quest to be his own master and who knows... maybe even a leader.

Python: Parse Docx To Html With Processing Styles

The Most Read

The 63 Greatest Qualities of a Leader

Leadership - Magic of Mastery

The Master Touch

Leadership Puzzle

Performance in Leading

Leadership for Dummies

Python: Parse Docx To Html With Processing Styles

The Most Read

Categories

The 63 Greatest Qualities of a Leader

Leadership - Magic of Mastery

The Master Touch

Leadership Puzzle

Performance in Leading

Leadership for Dummies