Neculai Fântânaru

Everything Depends On The Leader

Python: Convert Multiple Html Pages Into One Pdf File with Libraria Fpdf And Fpdf2

On August 27, 2022
, in
Python Scripts Examples by Neculai Fantanaru

You can view the entire code here: https://pastebin.com/hnAyWW2Q

We have many html files in which there are the following tags:

TITLE OF THE ARTICLE (will take the title of the article from the tag <h1 class="den_articol" itemprop="name"></h1>)

<td><h1 class="den_articol" itemprop="name">Abracadabra, cine esti?</h1></td>

DATE OF THE ARTICLE (will take the date from the tag : <td class="text_dreapta"></td>)

<td class="text_dreapta">On Iulie 19, 2010, in <a href="https://neculaifantanaru.com/leadership-impact.html" title="Vezi toate articolele din Leadership Impact" class="external" rel="category tag">Leadership Impact</a>, by Neculai Fantanaru</td>

BODY OF THE ARTICLE (will take all the sentences from the tags <p class="text_obisnuit"> si <p class="text_obisnuit2">

With the mention that these tags are included in the section < ! -- ARTICOL START --> si < ! -- ARTICOL FINAL --> )

< ! -- ARTICOL START -->

<p class="text_obisnuit">My name is Costel</p>

<p class="text_obisnuit2"><em>Games are my passion</em></p>

< ! -- ARTICOL FINAL -->

LINK ARTICLE (at the end of each article, the link to the html file will be displayed, preceded by https://neculaifantanaru.com/ )

abracadabra-cine-esti.html

The Python code will take the information from these tags, from all the html pages, and convert them into a single PDF file. Also, I made a dictionary to transform Unicode Characters into normal letters, in UTF-8 format.

In the same folder where you have the .py file, you will have to have another folder called "fonts" in which you will have to copy the windows fonts you work with.

CODE:

from fpdf import fpdf, html
import os
import re

from PyPDF2 import PdfFileMerger

def read_text_from_file(file_path):
    """
    Aceasta functie returneaza continutul unui fisier.
    file_path: calea catre fisierul din care vrei sa citesti
    """
    with open(file_path, encoding='utf8', errors='ignore') as f:
        text = f.read()
        f.close()
        return text
    file_content = re.sub('<span class="text_obisnuit2">(.*)</span>', '<b>\g<1></b>', file_content)

def write_to_file(text, file_path):
    """
    Aceasta functie scrie un text intr-un fisier.
    text: textul pe care vrei sa il scrii
    file_path: calea catre fisierul in care vrei sa scrii
    """
    with open(file_path, 'wb') as f:
        f.write(text.encode('utf8', 'ignore'))
        f.close()

dict_simboluri = dict()
dict_simboluri['&#259;'] = 'ă'
dict_simboluri['&#226;'] = 'â'
dict_simboluri['&atilde;'] = 'ã'
dict_simboluri['&acirc;'] = 'â'
dict_simboluri['&#x103;'] = 'ă'
dict_simboluri['&#xE2;'] = 'a'

dict_simboluri['  '] = ' '

dict_simboluri['&icirc;'] = 'î'
dict_simboluri['&#206;'] = 'Î'
dict_simboluri['&#238;'] = 'î'
dict_simboluri['&#xEE;'] = 'î'
dict_simboluri['&#xCE;'] = 'Î'
dict_simboluri['&#206;'] = 'Î'
dict_simboluri['&#xEE;'] = 'î'
dict_simboluri['&#xCE;'] = 'i'
dict_simboluri['&Icirc;'] = 'Î'

dict_simboluri['&nbsp;'] = ' '

dict_simboluri['&#537;'] = 'ș'
dict_simboluri['&#536;'] = 'Ș'
dict_simboluri['&#350;'] = 'Ş'
dict_simboluri['&#x219;'] = 'ș'
dict_simboluri['&#351;'] = 'ș'

dict_simboluri['&amp;'] = ''

dict_simboluri['&#539;'] = 'ț'
dict_simboluri['&#355;'] = 'ț'
dict_simboluri['&#354;'] = 'Ţ'
dict_simboluri['&#x21B;'] = 'ț'



def save_to_pdf(directory_path):
    for root, dirs, files in os.walk(directory_path):
        for file_name in files:
            if file_name.endswith(".html"):
                file_path = root + os.sep + file_name
                file_content = read_text_from_file(file_path)

                # creare fisier PDF
                class PDF(fpdf.FPDF, html.HTMLMixin):
                    pass

                pdf = PDF()
                pdf.add_page()
                pdf.add_font("Kanit", fname="fonts/Kanit-Regular.ttf")
                pdf.add_font("Kanit", style="B", fname="fonts/Kanit-Bold.ttf")
                pdf.add_font("Kanit", style="I", fname="fonts/Kanit-Italic.ttf")
                pdf.add_font("Kanit", style="BI", fname="fonts/Kanit-BoldItalic.ttf")
                pdf.set_font("Kanit", size=24)

                # extras denumire articol
                den_articol = re.search('<td><h1 class="den_articol" itemprop="name">(.*?)</h1></td>', file_content)
                if (den_articol == None):
                    print("Nu am gasit --- denumire articol --- in fisierul --- {} ---.".format(file_path))
                else:
                    den_articol = den_articol.group(1)
                    for simbol in dict_simboluri.keys():
                        den_articol = den_articol.replace(simbol, dict_simboluri[simbol])

                pdf.set_text_color(204, 0, 0) # rosu
                pdf.set_font('Kanit', size=14, style="B")
                pdf.multi_cell(w=190, txt=den_articol)
                pdf.ln()
                pdf.set_font('Kanit', size=12)

                # extras data
                date = re.search('<td class="text_dreapta">(.*?), in <a', file_content)
                if (date == None):
                    print("Nu am gasit --- date --- in fisierul --- {} ---.".format(file_path))
                else:
                    date = date.group(1)

                pdf.set_text_color(0, 102, 204) # albastru
                pdf.set_font('Kanit', size=8, style="B")
                pdf.cell(txt=date)
                pdf.ln()
                pdf.ln()
                pdf.ln()
                pdf.ln()
                pdf.set_text_color(0, 0, 0) # negru (default)
                pdf.set_font('Kanit', size=12)


                # extras text
                articol = re.search('<!-- ARTICOL START -->([\s\S]*?)<!-- ARTICOL FINAL -->', file_content)
                if (articol == None):
                    print("Nu am gasit --- ARTICOL START/FINAL --- in fisierul --- {} ---.".format(file_path))
                else:
                    articol = articol.group(1)
                    articol = articol.replace("&quot;", "\"")
                    articol = articol.replace("&rsquo;", "'")

                    # paragraphs
                    par_regex = re.compile('<p class="text_obisnuit.*?">.*?</p>')
                    pars = re.findall(par_regex, articol)
                    pars_text = list()

                    if (len(pars) == 0):
                        print("Nu am gasit -- paragrafe text_obisnuit -- in fisierul --- {} ---.".format(file_path))
                    else:
                        for i in range(0, len(pars)):
                            if ('<p class="text_obisnuit">' in pars[i]):

                                # identificam clasa text_obisnuit si preluam textul
                                content = re.findall('<p class="text_obisnuit">(.*?)</p>', pars[i])
                                if (len(content) == 0):
                                    print("Nu am gasit text in paragraful {}, fisierul {}.".format(pars[i], file_path))
                                else:

                                    # punem textul intr-o celula multi_cell
                                    for simbol in dict_simboluri.keys():
                                        content[0] = content[0].replace(simbol, dict_simboluri[simbol])
                                    pars_text.append(content[0])
                                    # pdf.multi_cell(w=190, txt = content[0])
                                    pdf.write_html(text=f'<p class="text_obisnuit">{content[0]}</p>')

                                    # adaugam linie goala intre paragrafe
                                    pdf.ln();
                            elif ('<p class="text_obisnuit2">' in pars[i]):

                                # identificam clasa text_obisnuit2 si preluam textul
                                content = re.findall('<p class="text_obisnuit2">(.*?)</p>', pars[i])
                                if (len(content) == 0):
                                    print("Nu am gasit text in paragraful {}, fisierul {}.".format(pars[i], file_path))
                                else:

                                    # setam fontul cu bold
                                    pdf.set_font('Kanit', size=12, style="B")


                                    # punem textul intr-o celula multi_cell
                                    for simbol in dict_simboluri.keys():
                                        content[0] = content[0].replace(simbol, dict_simboluri[simbol])
                                    pars_text.append(content[0])
                                    # pdf.multi_cell(w=190, txt = content[0])
                                    pdf.write_html(text=f'<p class="text_obisnuit2"><b>{content[0]}</b></p>')

                                    # adaugam linie goala intre paragrafe
                                    pdf.ln();

                                    # resetam fontul
                                    pdf.set_font('Kanit', size=12)
                            else:
                                continue

                    # adaugare link
                    pdf.ln()
                    pdf.ln()
                    pdf.set_font('Kanit', size=12, style="B")
                    pdf.cell(txt="Source:")
                    pdf.set_font('Kanit', size=12)
                    pdf.set_text_color(0, 102, 204) # albastru
                    pdf.cell(w=40, txt="https://neculaifantanaru.com/{}".format(file_name), link="https://neculaifantanaru.com/{}".format(file_name))

                    den_fisier = file_path.split('.')[0] + '.pdf'
                    pdf.output(den_fisier)
                    # break;

# functie care face merge la mai multe fisiere pdf
def merge_pdf_files(directory_path):
    merger = PdfFileMerger()
    for root, dirs, files in os.walk(directory_path):
        for file_name in files:
            if file_name.endswith(".pdf"):
                print("PDF: ", file_name)
                file_path = root + os.sep + file_name
                merger.append(file_path)
        merger.write(root + os.sep + "articles.pdf")
        merger.close()
        break;

save_to_pdf("c:\\Folder5\\")
merge_pdf_files("c:\\Folder5\\")

That's all folks.


If you like my code, then make me a favor: translate your website into Romanian, "ro".

Also, there is a VERSION 2 of this code or VERSION 3 or VERSION 4 or VERSION 5 or VERSION 6

Alatura-te Comunitatii Neculai Fantanaru
The 63 Greatest Qualities of a Leader
Cele 63 de calităţi ale liderului

Why read this book? Because it is critical to optimizing your performance. Because it reveals the main coordinates after that are build the character and skills of the leaders, highlighting what it is important for them to increase their influence.

Leadership - Magic of Mastery
Atingerea maestrului

The essential characteristic of this book in comparison with others on the market in the same domain is that it describes through examples the ideal competences of a leader. I never claimed that it's easy to become a good leader, but if people will...

The Master Touch
Leadership - Magia măiestriei

For some leaders, "leading" resembles more to a chess game, a game of cleverness and perspicacity; for others it means a game of chance, a game they think they can win every time risking and betting everything on a single card.

Leadership Puzzle
Leadership Puzzle

I wrote this book that conjoins in a simple way personal development with leadership, just like a puzzle, where you have to match all the given pieces in order to recompose the general image.

Performance in Leading
Leadership - Pe înţelesul tuturor

The aim of this book is to offer you information through concrete examples and to show you how to obtain the capacity to make others see things from the same angle as you.

Leadership for Dummies
Leadership - Pe înţelesul tuturor

Without considering it a concord, the book is representing the try of an ordinary man - the author - who through simple words, facts and usual examples instills to the ordinary man courage and optimism in his own quest to be his own master and who knows... maybe even a leader.