Name: Python: создание нескольких HTML-файлов из текстовых файлов и оптимизация тегов
Brand: Neculai Fantanaru
SKU: NFL
Availability: OnlineOnly
Rating: 5 (55 reviews)

Python: создание нескольких HTML-файлов из текстовых файлов и оптимизация тегов

On January 22, 2022

, in

Python Scripts Examples by Neculai Fantanaru

Полный код можно просмотреть здесь: ЗДЕСЬ.

Установите Python. Затем установите следующие две библиотеки с помощью интерпретатора командной строки (cmd) в Windows10:

py- m pip install unidecode
py -m pip install nltk

Вам нужно следующее:

1. Создайте папку под названием: fisiere_html (текстовые файлы будут сохранены здесь в формате html).

2. Создайте папку с названием LINKS (здесь вы создадите файл links.txt, в который нужно поместить друг под другом html — те ссылки, которые будут вставлены в качестве ключевых слов в тело статей с новых html-страниц).

3. Вам понадобится один HTML-файл с именем: oana.html. Он будет иметь такую структуру:

<title>Blah Blah Blah</title>
<meta name="description" content="Blah Blah Blah.">
<h3 class="font-weight-normal">TITLE OF THE ARTICLE</h3>
   <!-- ARTICOL START -->
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, 
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. 
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in 
reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
pariatur. Excepteur sint occaecat cupidatat non proident, sunt in 
culpa qui officia deserunt mollit anim id est laborum.</p>
   <!-- ARTICOL FINAL -->

4. В основную папку вам нужно будет поместить все текстовые файлы и файл oana.html

WHAT DOES THE CODE DO?:

1. Получите первые 10 слов из каждого текстового файла и сохраните этот файл как ссылку на эти 10 слов в формате HTML.

2. Получите первые 10 слов из каждого текстового файла и скопируйте их в раздел <title> тег и тег <h3 class> ярлык

3. Извлеките первые 20 слов из каждого текстового файла и скопируйте их в <мета-описание> ярлык.

4. Скопируйте все содержимое текстового файла в раздел < ! -- ARTICOL START --> < ! -- АРТИКОЛЬ ФИНАЛ --> (Замените существующий текст из html-файла)

5. Переименуйте новый HTML-файл по первым 10 словам текстового файла.

6. Проверьте, присутствуют ли в тексте ключевые слова в ссылках, расположенных в файле links.txt. Если да, он случайным образом выбирает слово из тела новой HTML-страницы и выделяет его как ССЫЛКУ. (слова-ссылки, такие как «и, кто, что, когда» будут исключены, поскольку они не являются ключевыми словами).

КОД: Скопируйте и запустите приведенный ниже код в любой программе-интерпретаторе (я использую pyScripter) .

#-------------------------------------------------------------------------------
# Name:        Create html files from text files
# Purpose:
#
# Author:      Neculai Fantanaru
#
# Created:     22/01/2022
# Copyright:   (c) Neculai Fantanaru 2022
#-------------------------------------------------------------------------------
import os
import re
import random
import unidecode
import nltk
from nltk import tokenize
# nltk.download('punkt')
SITE = 'https://neculaifantanaru.com/'
LISTA_CUVINTE_LEGATURA = [
   'in', 'la', 'unei', 'si', 'sa', 'se', 'de', 'prin', 'unde', 'care', 'a',
   'al', 'prea', 'lui', 'din', 'ai', 'unui', 'acei', 'un', 'doar', 'tine',
   'ale', 'sau', 'dintre', 'intre', 'cu', 'ce', 'va', 'fi', 'este', 'cand', 'o',
   'cine', 'aceasta', 'ca', 'dar', 'II', 'III', 'IV', 'V', 'VI', 'VII', 'VIII',
   'to', 'was', 'your', 'you', 'is', 'are', 'iar', 'fara', 'asta', 'pe', 'tu',
   'nu', 'mai', 'ne', 'le', 'intr', 'cum', 'e', 'for', 'she', 'it', 'esti',
'this', 'that', 'how', 'can', 't', 'must', 'be', 'the', 'and', 'do', 'so', 'or', 'ori',
'who', 'what', 'if', 'of', 'on', 'i', 'we', 'they', 'them', 'but', 'where', 'by', 'an',
'mi', '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', 'made', 'my', 'me', '-',
'vom', 'voi', 'ei', 'cat', 'ar', 'putea', 'poti', 'sunteti', 'inca', 'still', 'noi', 'l',
'ma', 's', 'dupa', 'after', 'under', 'sub', 'niste', 'some', 'those', 'he', 'no', 'too',
'fac', 'made', 'make', 'cei', 'most', 'face', 'pentru', 'cat', 'cate', 'much', 'more', 'many',
   'sale', 'tale', 'tau', 'has', 'sunt', 'his', 'yours', 'only', 'as', 'toate', 'all', 'tot', 'incat',
'which', 'ti', 'asa', 'like', 'these', 'because', 'unor', 'caci', 'ele', 'have', 'haven', 'te',
'cea', 'else', 'imi', 'iti', 'should', 'could', 'not', 'even', 'chiar', 'when', 'ci', 'ne', 'ni',
'her', 'our', 'alta', 'another', 'other', 'decat', 'acelasi', 'same', 'au', 'had', 'haven', 'hasn',
'alte', 'alt', 'others', 'ceea', 'cel', 'cele', 'alte', 'despre', 'about', 'acele', 'acel', 'acea',
'decit', 'with', '_', 'fata', 'towards', 'against', 'cind', 'dinspre', 'fost', 'been', 'era'
]
PATTERN_LINK = "<a href=\"{}\" target=\"_new\">{}</a>"
'''
structura dictionar cuvinte
{
    "cuvantul1": [lista_linkuri1],
    "cuvantul2": [lista_linkuri2]
}
'''
CALE_FISIER_LINKURI = "C:\\Folder1\\LINKS\\links.txt"
# folosim DEF cand vrem sa definim o functie => un cuvant cheie in Python
# REGULA: def nume_functie(lista_argumente)
def preia_cuvinte_link(link):
   cuvinte = link.split('.')[0] # [0] ia primul element iar daca pun [1] ia al doilea element
   cuvinte = cuvinte.split('-')
   cuvinte_ok = list()
   for cuv in cuvinte:
       if cuv not in LISTA_CUVINTE_LEGATURA:
           cuvinte_ok.append(cuv)
   return cuvinte_ok  # am pus retutn fiindca voi avea nevoie de rezultatul functiei de mai sus
def preia_cuvinte_lista_linkuri(cale_fisier_linkuri):
   lista_cuvinte_linkuri = list()
   dictionar_cuvinte_linkuri = dict()
   with open(cale_fisier_linkuri, encoding='utf8') as fp:
       lines = fp.readlines()
       for line in lines:
           # functia preia_cuvinte_link returneaza un rezultat care este salvat in variabila cuvinte_link
           cuvinte_link = preia_cuvinte_link(line)
           for cuv in cuvinte_link:
               if cuv in dictionar_cuvinte_linkuri.keys():
                   if not SITE + line.strip() in dictionar_cuvinte_linkuri[cuv]:
                       dictionar_cuvinte_linkuri[cuv].append(SITE + line.strip())
               else:
                   dictionar_cuvinte_linkuri[cuv] = [SITE + line.strip()]
           lista_cuvinte_linkuri.extend(cuvinte_link)
   lista_cuvinte_linkuri = list(set(lista_cuvinte_linkuri))
   return lista_cuvinte_linkuri, dictionar_cuvinte_linkuri
def citeste_fisier_linie_cu_linie(cale_fisier):
   with open(cale_fisier, encoding='utf8') as fp:
       lines = fp.readlines()
       count = 0
       for line in lines:
           print(count, line.strip())
           count += 1
def read_text_from_file(file_path):
   """
    Aceasta functie returneaza continutul unui fisier.
    file_path: calea catre fisierul din care vrei sa citesti
    """
   with open(file_path, encoding='utf8') as f:
       text = f.read()
       return text
def write_to_file(text, file_path):
   """
    Aceasta functie scrie un text intr-un fisier.
    text: textul pe care vrei sa il scrii
    file_path: calea catre fisierul in care vrei sa scrii
    """
   with open(file_path, 'wb') as f:
       f.write(text.encode('utf8', 'ignore'))
def split_propozitii(text):
   # 01.02.2022: folosit librarie pentru extragerea propozitiilor
   propozitii = tokenize.sent_tokenize(text)
   # 01.02.2022: scoatem spatiile in plus de la inceputul/finalul propozitiilor si facem prima litera mare
   propozitii = [prop.strip().capitalize() for prop in propozitii]
   # 01.02.2022: scot spatiile in plus de la final de propozitie. De exemplu: "ana are mere  ?" => "ana are mere?"
   propozitii = [prop[:-1].strip() + prop[-1] for prop in propozitii]
   # 31.01.2022: modificat tag-ul p si adaugat css (4)
   tag = "<p class=\"mb-40px\">{}</p>"
   text_start_final = ""
   # print(len(propozitii))
   numar_propozitii_grup = 7
   numar_grupuri = int(len(propozitii) / numar_propozitii_grup)
   start = 0
   LINK_INTRODUS = 0
   for numar_grup in range(numar_grupuri):
       # print("Iteratia: ", numar_grup)
       lista_cuvinte_gasite = list()
       if numar_grup != 0 and numar_grup != numar_grupuri - 1:
           # 31.01.2022: fixat bug (1)
           text_tag = " ".join(propozitii[start:(start + numar_propozitii_grup)])
           if LINK_INTRODUS == 0:
               cuvinte = re.findall(r' (?:\w|-*\!)+[ ,]', text_tag)
               cuvinte_linkuri, dictionar_linkuri = preia_cuvinte_lista_linkuri(CALE_FISIER_LINKURI)
               for cuv in cuvinte:
                   cuv_fara_semne = cuv.replace(' ', '')
                   cuv_fara_semne = cuv_fara_semne.replace(',', '')
                   if cuv_fara_semne in dictionar_linkuri.keys():
                       lista_cuvinte_gasite.append(cuv)
               lista_cuvinte_gasite = list(set(lista_cuvinte_gasite))
               cuvant_random = random.sample(lista_cuvinte_gasite, 1)[0]
               cuvant_random_fara_semne = cuvant_random.replace(' ', '')
               cuvant_random_fara_semne = cuvant_random_fara_semne.replace(',', '')
               link_random = random.sample(dictionar_linkuri[cuvant_random_fara_semne], 1)[0]
               # singur cuvant subliniat
               pattern = PATTERN_LINK.format(link_random, cuvant_random.strip())
               text_tag = text_tag.replace(cuvant_random.strip(), pattern, 1)
               LINK_INTRODUS = 1
               # doua cuvinte subliniate
               '''
                expresie_regulata = cuvant_random.strip() + r' *\w+'
                urmatorul_cuvant = re.findall(expresie_regulata, text_tag)[0]
                pattern = PATTERN_LINK.format(link_random, urmatorul_cuvant)
                text_tag = text_tag.replace(urmatorul_cuvant, pattern, 1)
                LINK_INTRODUS = 1
                '''
           text_tag = tag.format(text_tag)
           text_start_final = text_start_final + '\n' + text_tag
           start = start + numar_propozitii_grup
       else:
           # 31.01.2022: fixat bug (1)
           text_tag = " ".join(propozitii[start:(start + numar_propozitii_grup)])
           text_tag = tag.format(text_tag)
           text_start_final = text_start_final + '\n' + text_tag
           start = start + numar_propozitii_grup
   text_tag = " ".join(propozitii[start:len(propozitii)])
   text_tag = tag.format(text_tag)
   text_start_final = text_start_final + '\n' + text_tag
   # print(text_start_final)
   # 31.01.2022: Verificat, paragrafele se afiseaza frumos unul sub altul (5)
   return text_start_final
def copiaza_continut_txt_html(cale_fisier_txt, cale_fisier_html): # astea sunt argumentele functiei, adica cand apelez functia
   # citesti textul din fisier
   text_txt = read_text_from_file(cale_fisier_txt)
   # split dupa '\n'
   lines = text_txt.splitlines()
   ok_lines = list()
   for line in lines:
       if line == '' or line == '\ufeff':
           continue
       else:
           ok_lines.append(line)
   # 02.02.2022: titlul e format din primele 10 cuvinte din text
   # title_words = re.findall(r'(?:\w|-*\!)+', ok_lines[0])
   title_words = re.findall(r'(?:\w|-*\!)+', ok_lines[0])[:10]
   description_words = re.findall(r'(?:\w|-*\!)+', ok_lines[0])
   description_words = u' '.join(description_words[:20])
   # print("title: ", title_words)
   # print("description: ", description_words)
   text_html = read_text_from_file(cale_fisier_html)
   # aici e pattern-ul pentru expresia regex; (.*?) inseamna ca preia tot ce este intre tag-uri
   # modifici expresia regulata in functie de ce tag dai ca argument pentru functie
   articol_pattern = re.compile('<!-- ARTICOL START -->([\s\S]*?)<!-- ARTICOL FINAL -->[\s\S]*?')
   text_articol = re.findall(articol_pattern, text_html)
   if len(text_articol) != 0:
       text_articol = text_articol[0]
       text_txt = split_propozitii(text_txt)
       text_txt = '\n\n' + text_txt + '\n\n'
       text_html = text_html.replace(text_articol, text_txt)
   else:
       print("Fisier html fara ARTICOL START/FINAL.")
   title_pattern = re.compile('<title>(.*?)</title>')
   text_title = re.findall(title_pattern, text_html)
   # 01.02.2022: inlocuire h3 cu text titlu (2)
   h3_pattern = re.compile('<h3 class=\"font-weight-normal\"><a href=\"javascript:void\(0\)\" class=\"color-black\">(.*?)</a></h3>')
   text_h3 = re.findall(h3_pattern, text_html)
   if len(text_title) != 0:
       text_title = text_title[0]
       # inlocuire semne
       expresii_regex = [r'\.', r'\,', r'\?', r'\!', r'\:', r'\;', r'\"']
       for exp_reg in expresii_regex:
           title_words = [re.sub(exp_reg, '-', word) for word in title_words]
       # creare nume nou link
       new_filename = u'-'.join(title_words).lower()
       new_file_name_fara_spatiu = unidecode.unidecode(new_filename)
       new_file_name_fara_spatiu = new_file_name_fara_spatiu + '.html'
       # inlocuire text titlu cu primele 10 cuvinte
       text_html = text_html.replace(text_title,  u' '.join(title_words))
       # 01.02.2022: inlocuire h3 cu text titlu (2)
       if len(text_h3) != 0:
           text_h3 = text_h3[0]
           text_html = text_html.replace(text_h3, u' '.join(title_words))
       else:
           print("Fisierul nu are tag-ul h3.")
       # 07.02.2022: inlocuire text canonical tag
       canonical_tag_pattern = re.compile('<link rel="canonical" href="(.*?)" />')
       canonical_tag = re.findall(canonical_tag_pattern, text_html)
       if len(canonical_tag) != 0:
           canonical_tag = canonical_tag[0]
           #text_html = text_html.replace(canonical_tag, new_file_name_fara_spatiu)
           # daca trebuie sa pui si "https://neculaifantanaru.com/" in fata, comentezi linia de mai sus si o decomentezi pe cea de jos
           text_html = text_html.replace(canonical_tag, "https://trinketbox.ro/" + new_file_name_fara_spatiu)
       else:
           print("Fisier fara tag canonical")
   else:
       print("Fisier html fara titlu.")
   description_pattern = re.compile('<meta name="description" content="(.*?)">')
   text_description = re.findall(description_pattern, text_html)
   if len(text_description) != 0:
       text_description = text_description[0]
       # print("text description: ", text_description)
       text_html = text_html.replace(text_description, description_words)
   else:
       print("Fisier html fara description.")
   file_path = os.path.dirname(cale_fisier_txt) + "\\" + "fisiere_html" + "\\" + new_file_name_fara_spatiu
   write_to_file(text_html, file_path)
   # print("Fisier: ", new_file_name_fara_spatiu)
   print("Scriere efectuata cu succes.")
def creare_fisiere_html(cale_folder_txt, cale_fisier_html):
   """
    Functia itereaza printr-un folder care contine fisiere txt si creeaza fisiere html corespunzatoare
    """
   count = 0
   for f in os.listdir(cale_folder_txt):
           if f.endswith('.txt'):
               cale_fisier_txt = cale_folder_txt + "\\" + f
               copiaza_continut_txt_html(cale_fisier_txt, cale_fisier_html)
               count += 1
           else:
               continue
   print("Numarul de fisiere modificate: ", count)
def main():
   creare_fisiere_html("C:\\Folder1", "C:\\Folder1\\index.html")
   # lista_cuvinte, dictionar_cuvinte = preia_cuvinte_lista_linkuri(CALE_FISIER_LINKURI)
   # print(len(lista_cuvinte)) # len - arata dmensiunea
   # print(dictionar_cuvinte)
if __name__ == '__main__':
   main()

That's all folks.

If you like my code, then make me a favor: translate your website into Romanian, "ro".

Также ознакомьтесь с этой ВЕРСИЕЙ 2 или ВЕРСИЯ 3 или ВЕРСИЯ 4 или ВЕРСИЯ 5 или ВЕРСИЯ 6 или ВЕРСИЯ 7

Alatura-te Comunitatii Neculai Fantanaru

63 величайших качества лидера

Зачем читать эту книгу? Потому что это имеет решающее значение для оптимизации вашей производительности. Потому что раскрывает основные координаты, после чего строят характер и навыки лидеров, подчеркивая, что им важно для повышения своего влияния.

Лидерство – магия мастерства

Существенной характеристикой этой книги по сравнению с другими книгами, представленными на рынке в той же области, является то, что она описывает на примерах идеальные компетенции лидера. Я никогда не утверждал, что стать хорошим лидером легко, но если люди будут...

Мастерское прикосновение

Для некоторых лидеров «руководство» больше напоминает шахматную игру, игру ума и проницательности; для других это означает азартную игру, игру, которую, как они думают, они могут выиграть каждый раз, рискуя и ставя все на одну карту.

Загадка лидерства

Я написал эту книгу, которая простым способом соединяет личностное развитие с лидерством, как пазл, где нужно соединять все данные кусочки, чтобы составить общий образ.

Руководство

Цель этой книги — предоставить вам информацию на конкретных примерах и показать, как обрести способность заставить других смотреть на вещи под той же точкой зрения, что и вы.

Лидерство для чайников

Не считая это согласием, книга представляет собой попытку обычного человека - автора - который простыми словами, фактами и обычными примерами вселяет в обычного человека смелость и оптимизм в его собственном стремлении быть хозяином самому себе и кто знает. ..может даже лидер.