You can view the full code here: https://pastebin.com/6gNGpbV6
Install Python.
You need the following files, which you can put in the same folder:
- Create an empty file called rss.xml (this is the file in which all html pages will be generated)
- Create a file called final_xml.txt in which you have to copy this tag on the first line:
</urlset>
- Create a file called start_xml.txt in which you will have to have the following code at the beginning:
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xhtml="http://www.w3.org/1999/xhtml" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <!-- www.check-domains.com sitemap generator -->
- Create a file called model_xml.txt where you will need to have the following code at the beginning:
<url> <loc>https://YOUR-WEBSITE.com/example-page.html</loc> <lastmod>2021-11-30T17:19:37+00:00</lastmod> <changefreq>weekly</changefreq> <priority>0.6400</priority> </url>
PYTHON CODE (save it anywhere, just change the path with your folder with html files)
import os import re import random import unidecode def read_text_from_file(file_path): """ Aceasta functie returneaza continutul unui fisier. file_path: calea catre fisierul din care vrei sa citesti """ with open(file_path, encoding='utf8') as f: text = f.read() return text def write_to_file(text, file_path): """ Aceasta functie scrie un text intr-un fisier. text: textul pe care vrei sa il scrii file_path: calea catre fisierul in care vrei sa scrii """ with open(file_path, 'wb') as f: f.write(text.encode('utf8', 'ignore')) def creeaza_fisier_xml(): lista_nume_fisiere = preia_nume_fisiere_html( 'e:\\YOUR-FOLDER-WITH-HTML-FILES', # files to ignore ['404-1.html', '404-2.html', '404-3.html'] ) start_xml = read_text_from_file('start_xml.txt') final_xml = read_text_from_file('final_xml.txt') model_xml = read_text_from_file('model_xml.txt') xml_text = start_xml + '\n' link_pattern = re.compile('<loc>(.*?)</loc>') link = re.findall(link_pattern, model_xml) if len(link) != 0: link = link[0] for nume_fisier in lista_nume_fisiere: model = model_xml model = model.replace(link, 'https://YOUR-WEBSITE.com/' + nume_fisier) xml_text = xml_text + model + '\n' xml_text = xml_text + final_xml write_to_file(xml_text, 'rss.xml') print("Scriere efectuata cu succes.") def preia_nume_fisiere_html(folder_fisiere_html, lista_fisiere_de_ignorat): lista_nume_fisiere_html = list() for f in os.listdir(folder_fisiere_html): if f.endswith('.html'): lista_nume_fisiere_html.append(f) lista_finala = list() for f in lista_nume_fisiere_html: if f not in lista_fisiere_de_ignorat: lista_finala.append(f) return lista_finala def main(): creeaza_fisier_xml() if __name__ == '__main__': main()
That's all folks.
If you like my code, then make me a favor: translate your website into Romanian, "ro".
Also, see this VERSION 2 or VERSION 3 or VERSION 4 or VERSION 5 or VERSION 6 or VERSION 7