ro  fr  en  es  pt  ar  zh  hi  de  ru
ART 2.0 ART 3.0 ART 4.0 ART 5.0 ART 6.0 Pinterest

Ahoana ny fomba hamoronana processor batch miaraka amin'ny python sy regex hanoloana ny tags html (parsing)

On Iunie 16, 2021, in Leadership and Attitude, by Neculai Fantanaru

Azonao atao ny mijery ny kaody feno:Https: // passin.com / i num5q dia 5

Ohatra iray momba ny pejy HTML izay hovana amin'ny kaody python. Adikao amin'ny rakitra .html ny lahatsoratra etsy ambony, afatory amin'ny toeranaC: \ Folder1

   

 xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="ro">

Ahoana ny fomba hamoronana processor batch miaraka amin'ny python sy regex hanoloana ny tags html (parsing)
 rel="canonical" href="https://MY-WEBSITE.COM" />
 name="description" content="I LOVE HTML and CSS"/>

 name="keywords" content="abordarea frontala a lucrurilor neelucidate"/>
 name="abstract" content="My laptop works just fine"/>
 name="Subject" content="I think I need a new car."/>
 property="og:url" content="https://otherwebsite.com"/>
 property="og:title" content="Nobody is here?" />
 property="og:description" content="Dance is my passion."/>





Ny kaody popershell eto ambany dia handika ny votoatin'ny Tags HTML amin'ny marika hafa amin'ny alàlan'ny fanoratana ny angon-drakitra. Mila mameno ny tags fotsiny ianaoSY

import requests
import re

# Path to english folder 1
english_folder1 = r"c:\Folder1"

# Path to english folder 2
english_folder2 = r"c:\Folder1"

extension_file = ".html"

use_parse_folder = True #Face folder nou daca pui True, iar daca pui False redenumeste fisierele in acelasi folder

import os

en1_directory = os.fsencode(english_folder1)
en2_directory = os.fsencode(english_folder2)

print('Going through english folder')
for file in os.listdir(en1_directory):
    filename = os.fsdecode(file)
    print(filename)
    if filename == 'y_key_e479323ce281e459.html' or filename == 'TS_4fg4_tr78.html':
        continue
    if filename.endswith(extension_file):
        with open(os.path.join(english_folder1, filename), encoding='utf-8') as html:
            html = html.read()

            try:
                with open(os.path.join(english_folder2, filename), encoding='utf-8') as en_html:
                    en_html = en_html.read()

                    if False:  # if True: will Parse also the content that starts from  to  and so on
                        try:
                            comment_body = re.search('.+', html, flags=re.DOTALL)[0]
                            en_html = re.sub('.+', comment_body, en_html, flags=re.DOTALL)
                        except:
                            pass

                        try:
                            comment_body2 = re.search('.+', html, flags=re.DOTALL)[0]
                            en_html = re.sub('.+', comment_body2, en_html, flags=re.DOTALL)
                        except:
                            pass

                        try:
                            comment_body3 = re.search('.+', html, flags=re.DOTALL)[0]
                            en_html = re.sub('.+', comment_body3, en_html, flags=re.DOTALL)
                        except:
                            pass

                    # title to meta
                    try:
                        title = re.search('Ahoana ny fomba hamoronana processor batch miaraka amin'ny python sy regex hanoloana ny tags html (parsing)', html)[0]
                        title_content = re.search('>(.+)<', title)[1]
                    except:
                        pass

                    try:
                        meta_og_title = re.search('', en_html)[0]
                        new_meta_og_title = re.sub(r'content=".+"', f'content="{title_content}"', meta_og_title)
                        en_html = en_html.replace(meta_og_title, new_meta_og_title)
                    except:
                        pass

                    try:
                        meta_keywords = re.search('', en_html)[0]
                        new_meta_keywords = re.sub(r'content=".+"', f'content="{title_content}"', meta_keywords)
                        en_html = en_html.replace(meta_keywords, new_meta_keywords)
                    except:
                        pass

                    try:
                        meta_abstract = re.search('', en_html)[0]
                        new_meta_abstract = re.sub(r'content=".+"', f'content="{title_content}"', meta_abstract)
                        en_html = en_html.replace(meta_abstract, new_meta_abstract)
                    except:
                        pass

                    try:
                        meta_Subject = re.search('', en_html)[0]
                        new_meta_Subject = re.sub(r'content=".+"', f'content="{title_content}"', meta_Subject)
                        en_html = en_html.replace(meta_Subject, new_meta_Subject)
                    except:
                        pass

                    try:
                        headline = re.search('"headline":.+', en_html)[0]
                        new_headline = re.sub(r':.+', f': "{title_content}",', headline)
                        en_html = en_html.replace(headline, new_headline)
                    except:
                        pass

                    try:
                        keywords = re.search('"keywords": "Ahoana ny fomba hamoronana processor batch miaraka amin'ny python sy regex hanoloana ny tags html (parsing)",
                        new_keywords = re.sub(r':.+', f': "{title_content}",', keywords)
                        en_html = en_html.replace(keywords, new_keywords)
                    except:
                        pass

                    # canonical to meta og:url and @id
                    try:
                        canonical_content = re.search('', html)[1]
                    except:
                        pass

                    try:
                        og_url = re.search('', en_html)[0]
                        new_og_url = re.sub(r'content=".+"', f'content="{canonical_content}"', og_url)
                        en_html = en_html.replace(og_url, new_og_url)
                    except:
                        pass

                    try:
                        id = re.search('"@id":.+', en_html)[0]
                        new_id = re.sub(r':.+', f': "{canonical_content}"', id)
                        en_html = en_html.replace(id, new_id)
                    except:
                        pass

                    # meta description to og:description and description
                    try:
                        meta = re.search(']
                        meta_description = re.search(']
                    except:
                        pass

                    try:
                        og_description = re.search('', en_html)[0]
                        new_og_description = re.sub(r'content=".+"', f'content="{meta_description}"', og_description)
                        en_html = en_html.replace(og_description, new_og_description)
                    except:
                        pass

                    try:
                        description = re.search('"description": "Ahoana ny fomba hamoronana processor batch miaraka amin'ny python sy regex hanoloana ny tags html (parsing) |  Neculai Fantanaru",
                        new_description = re.sub(r':.+', f': "{meta_description}",', description)
                        en_html = en_html.replace(description, new_description)
                    except:
                        pass

                    try:
                        en_html = re.sub(', meta, en_html)
                    except:
                        pass

                    try:
                        en_html = re.sub('Ahoana ny fomba hamoronana processor batch miaraka amin'ny python sy regex hanoloana ny tags html (parsing)', title, en_html)
                    except:
                        pass
            except FileNotFoundError:
                continue

        print(f'{filename} parsed')
        if use_parse_folder:
            try:
                with open(os.path.join(english_folder2+r'\parsed', 'parsed_'+filename), 'w', encoding='utf-8') as new_html:
                    new_html.write(en_html)
            except:
                os.mkdir(english_folder2+r'\parsed')
                with open(os.path.join(english_folder2+r'\parsed', 'parsed_'+filename), 'w', encoding='utf-8') as new_html:
                    new_html.write(en_html)
        else:
            with open(os.path.join(english_folder2, 'parsed_'+filename), 'w', encoding='utf-8') as html:
                html.write(en_html)
  

Safidy. Ity misy fanehoana regex izay hanova ny tag "Keywords" ao amin'ny pejin'ny HTML, manampy koma aorian'ny teny tsirairay.

Ampiasao amin'ny Netepad ++ -> CTR + F -> Zahao: fanamarihana mahazatra

SEARCH: (?s)<title>.*?<\/title>.*?<meta\x20name="keywords"\x20content="\K(\w+)|\G[^\w\r\n]+(\w+)  
REPLACE BY:  ?1\l\1:,\x20\l\2

Azonao atao ny manandrana ity karapanondro sarotra ity, izay misimisy kokoa: Raiso ny angon-drakitra avy amin'ny tag ary kopy amin'ny tag Azonao jerena eto ny kaody eto:

HTTPS: // passatin.com / JM5 Governemanta 2Qs

import requests
import re

# Path to english folder 1

english_folder2 = r"c:\Folder1"

extension_file = ".html"

use_parse_folder = True

import os

en1_directory = os.fsencode(english_folder2)
en2_directory = os.fsencode(english_folder2)

# These connection words will be ignore when parsing data from  tag to <meta keywords> tag</span>
LISTA_CUVINTE_LEGATURA <span style="color: #333333">=</span> [
    <span style="background-color: #fff0f0">'in'</span>, <span style="background-color: #fff0f0">'la'</span>, <span style="background-color: #fff0f0">'unei'</span>, <span style="background-color: #fff0f0">'si'</span>, <span style="background-color: #fff0f0">'sa'</span>, <span style="background-color: #fff0f0">'se'</span>, <span style="background-color: #fff0f0">'de'</span>, <span style="background-color: #fff0f0">'prin'</span>, <span style="background-color: #fff0f0">'unde'</span>, <span style="background-color: #fff0f0">'care'</span>, <span style="background-color: #fff0f0">'a'</span>,
    <span style="background-color: #fff0f0">'al'</span>, <span style="background-color: #fff0f0">'prea'</span>, <span style="background-color: #fff0f0">'lui'</span>, <span style="background-color: #fff0f0">'din'</span>, <span style="background-color: #fff0f0">'ai'</span>, <span style="background-color: #fff0f0">'unui'</span>, <span style="background-color: #fff0f0">'acei'</span>, <span style="background-color: #fff0f0">'un'</span>, <span style="background-color: #fff0f0">'doar'</span>, <span style="background-color: #fff0f0">'tine'</span>,
    <span style="background-color: #fff0f0">'ale'</span>, <span style="background-color: #fff0f0">'sau'</span>, <span style="background-color: #fff0f0">'dintre'</span>, <span style="background-color: #fff0f0">'intre'</span>, <span style="background-color: #fff0f0">'cu'</span>,<span style="background-color: #fff0f0">'ce'</span>, <span style="background-color: #fff0f0">'va'</span>, <span style="background-color: #fff0f0">'fi'</span>, <span style="background-color: #fff0f0">'este'</span>, <span style="background-color: #fff0f0">'cand'</span>, <span style="background-color: #fff0f0">'o'</span>,
    <span style="background-color: #fff0f0">'cine'</span>, <span style="background-color: #fff0f0">'aceasta'</span>, <span style="background-color: #fff0f0">'ca'</span>, <span style="background-color: #fff0f0">'dar'</span>, <span style="background-color: #fff0f0">'II'</span>, <span style="background-color: #fff0f0">'III'</span>, <span style="background-color: #fff0f0">'IV'</span>, <span style="background-color: #fff0f0">'V'</span>, <span style="background-color: #fff0f0">'VI'</span>, <span style="background-color: #fff0f0">'VII'</span>, <span style="background-color: #fff0f0">'VIII'</span>,
    <span style="background-color: #fff0f0">'to'</span>, <span style="background-color: #fff0f0">'was'</span>, <span style="background-color: #fff0f0">'your'</span>, <span style="background-color: #fff0f0">'you'</span>, <span style="background-color: #fff0f0">'is'</span>, <span style="background-color: #fff0f0">'are'</span>, <span style="background-color: #fff0f0">'iar'</span>, <span style="background-color: #fff0f0">'fara'</span>, <span style="background-color: #fff0f0">'aceasta'</span>, <span style="background-color: #fff0f0">'pe'</span>, <span style="background-color: #fff0f0">'tu'</span>,
    <span style="background-color: #fff0f0">'nu'</span>, <span style="background-color: #fff0f0">'mai'</span>, <span style="background-color: #fff0f0">'ne'</span>, <span style="background-color: #fff0f0">'le'</span>, <span style="background-color: #fff0f0">'intr'</span>, <span style="background-color: #fff0f0">'cum'</span>, <span style="background-color: #fff0f0">'e'</span>, <span style="background-color: #fff0f0">'for'</span>, <span style="background-color: #fff0f0">'she'</span>, <span style="background-color: #fff0f0">'it'</span>, <span style="background-color: #fff0f0">'esti'</span>,
	<span style="background-color: #fff0f0">'this'</span>, <span style="background-color: #fff0f0">'that'</span>, <span style="background-color: #fff0f0">'how'</span>, <span style="background-color: #fff0f0">'can'</span>, <span style="background-color: #fff0f0">'t'</span>, <span style="background-color: #fff0f0">'must'</span>, <span style="background-color: #fff0f0">'be'</span>, <span style="background-color: #fff0f0">'the'</span>, <span style="background-color: #fff0f0">'and'</span>, <span style="background-color: #fff0f0">'do'</span>, <span style="background-color: #fff0f0">'so'</span>, <span style="background-color: #fff0f0">'or'</span>, <span style="background-color: #fff0f0">'ori'</span>,
	<span style="background-color: #fff0f0">'who'</span>, <span style="background-color: #fff0f0">'what'</span>, <span style="background-color: #fff0f0">'if'</span>, <span style="background-color: #fff0f0">'of'</span>, <span style="background-color: #fff0f0">'on'</span>, <span style="background-color: #fff0f0">'i'</span>, <span style="background-color: #fff0f0">'we'</span>, <span style="background-color: #fff0f0">'they'</span>, <span style="background-color: #fff0f0">'them'</span>, <span style="background-color: #fff0f0">'but'</span>, <span style="background-color: #fff0f0">'where'</span>, <span style="background-color: #fff0f0">'by'</span>, <span style="background-color: #fff0f0">'an'</span>,
	<span style="background-color: #fff0f0">'on'</span>, <span style="background-color: #fff0f0">'1'</span>, <span style="background-color: #fff0f0">'2'</span>, <span style="background-color: #fff0f0">'3'</span>, <span style="background-color: #fff0f0">'4'</span>, <span style="background-color: #fff0f0">'5'</span>, <span style="background-color: #fff0f0">'6'</span>, <span style="background-color: #fff0f0">'7'</span>, <span style="background-color: #fff0f0">'8'</span>, <span style="background-color: #fff0f0">'9'</span>, <span style="background-color: #fff0f0">'0'</span>, <span style="background-color: #fff0f0">'made'</span>, <span style="background-color: #fff0f0">'make'</span>, <span style="background-color: #fff0f0">'my'</span>, <span style="background-color: #fff0f0">'me'</span>, <span style="background-color: #fff0f0">'-'</span>,
	<span style="background-color: #fff0f0">'vom'</span>, <span style="background-color: #fff0f0">'voi'</span>, <span style="background-color: #fff0f0">'ei'</span>, <span style="background-color: #fff0f0">'cat'</span>, <span style="background-color: #fff0f0">'ar'</span>, <span style="background-color: #fff0f0">'putea'</span>, <span style="background-color: #fff0f0">'poti'</span>, <span style="background-color: #fff0f0">'sunteti'</span>, <span style="background-color: #fff0f0">'inca'</span>, <span style="background-color: #fff0f0">'still'</span>, <span style="background-color: #fff0f0">'noi'</span>, <span style="background-color: #fff0f0">'l'</span>,
	<span style="background-color: #fff0f0">'ma'</span>, <span style="background-color: #fff0f0">'s'</span>, <span style="background-color: #fff0f0">'dupa'</span>, <span style="background-color: #fff0f0">'after'</span>, <span style="background-color: #fff0f0">'under'</span>, <span style="background-color: #fff0f0">'sub'</span>, <span style="background-color: #fff0f0">'niste'</span>, <span style="background-color: #fff0f0">'some'</span>, <span style="background-color: #fff0f0">'those'</span>, <span style="background-color: #fff0f0">'he'</span>
]

<span style="color: #008800; font-weight: bold">def</span> <span style="color: #0066BB; font-weight: bold">creeaza_lista_keywords</span>(titlu):
    <span style="color: #888888"># imparte titlul in 2 in functie de bara verticala |</span>
    prima_parte_titlu <span style="color: #333333">=</span> titlu<span style="color: #333333">.</span>split(<span style="background-color: #fff0f0">'|'</span>)[<span style="color: #0000DD; font-weight: bold">0</span>]
    <span style="color: #888888"># extrage toate cuvintele din prima parte a titlului</span>
    keywords <span style="color: #333333">=</span> re<span style="color: #333333">.</span>findall(<span style="background-color: #fff0f0">r'(?:\w|-*\!)+'</span>, prima_parte_titlu)
    <span style="color: #888888"># extrage keyword-urile care nu se gasesc in lista de cuvinte de legatura</span>
    keywords_OK <span style="color: #333333">=</span> <span style="color: #007020">list</span>()
    <span style="color: #008800; font-weight: bold">for</span> keyword <span style="color: #000000; font-weight: bold">in</span> keywords:
        <span style="color: #008800; font-weight: bold">if</span> keyword <span style="color: #000000; font-weight: bold">not</span> <span style="color: #000000; font-weight: bold">in</span> LISTA_CUVINTE_LEGATURA:
            <span style="color: #888888"># adauga keyword-ul cu litere mici</span>
            keywords_OK<span style="color: #333333">.</span>append(keyword<span style="color: #333333">.</span>lower())
    <span style="color: #888888"># returneaza un string in care toate keyword-urile sunt alaturate prin ', '</span>
    <span style="color: #008800; font-weight: bold">return</span> <span style="background-color: #fff0f0">", "</span><span style="color: #333333">.</span>join(keywords_OK)


<span style="color: #008800; font-weight: bold">print</span>(<span style="background-color: #fff0f0">'Going through english folder'</span>)
amount <span style="color: #333333">=</span> <span style="color: #0000DD; font-weight: bold">1</span>
<span style="color: #008800; font-weight: bold">for</span> <span style="color: #007020">file</span> <span style="color: #000000; font-weight: bold">in</span> os<span style="color: #333333">.</span>listdir(en1_directory):
    filename <span style="color: #333333">=</span> os<span style="color: #333333">.</span>fsdecode(<span style="color: #007020">file</span>)
    <span style="color: #008800; font-weight: bold">print</span>(filename)
    <span style="color: #008800; font-weight: bold">if</span> filename <span style="color: #333333">==</span> <span style="background-color: #fff0f0">'y_key_e479323ce281e459.html'</span> <span style="color: #000000; font-weight: bold">or</span> filename <span style="color: #333333">==</span> <span style="background-color: #fff0f0">'directory.html'</span>:
        <span style="color: #008800; font-weight: bold">continue</span>
    <span style="color: #008800; font-weight: bold">if</span> filename<span style="color: #333333">.</span>endswith(extension_file):
        <span style="color: #008800; font-weight: bold">with</span> <span style="color: #007020">open</span>(os<span style="color: #333333">.</span>path<span style="color: #333333">.</span>join(english_folder2, filename), encoding<span style="color: #333333">=</span><span style="background-color: #fff0f0">'utf-8'</span>) <span style="color: #008800; font-weight: bold">as</span> html:
            html <span style="color: #333333">=</span> html<span style="color: #333333">.</span>read()

            <span style="color: #008800; font-weight: bold">try</span>:
                <span style="color: #008800; font-weight: bold">with</span> <span style="color: #007020">open</span>(os<span style="color: #333333">.</span>path<span style="color: #333333">.</span>join(english_folder2, filename), encoding<span style="color: #333333">=</span><span style="background-color: #fff0f0">'utf-8'</span>) <span style="color: #008800; font-weight: bold">as</span> en_html:
                    en_html <span style="color: #333333">=</span> en_html<span style="color: #333333">.</span>read()

                    <span style="color: #888888"># title to meta</span>
                    <span style="color: #008800; font-weight: bold">try</span>:
                        title <span style="color: #333333">=</span> re<span style="color: #333333">.</span>search(<span style="background-color: #fff0f0">'<title>Ahoana ny fomba hamoronana processor batch miaraka amin'ny python sy regex hanoloana ny tags html (parsing)', html)[0]
                        title_content = re.search('>(.+)<', title)[1]
                    except:
                        pass

                    try:
                        meta_og_title = re.search('', en_html)[0]
                        new_meta_og_title = re.sub(r'content=".+"', f'content="{title_content}"', meta_og_title)
                        en_html = en_html.replace(meta_og_title, new_meta_og_title)
                    except:
                        pass

                    try:
                        meta_keywords = re.search('', en_html)[0]
                        keywords = creeaza_lista_keywords(title_content)
                        new_meta_keywords = re.sub(r'content=".+"', f'content="{keywords}"', meta_keywords)
                        en_html = en_html.replace(meta_keywords, new_meta_keywords)
                    except:
                        pass

                    try:
                        meta_abstract = re.search('', en_html)[0]
                        new_meta_abstract = re.sub(r'content=".+"', f'content="{title_content}"', meta_abstract)
                        en_html = en_html.replace(meta_abstract, new_meta_abstract)
                    except:
                        pass

                    try:
                        meta_Subject = re.search('', en_html)[0]
                        new_meta_Subject = re.sub(r'content=".+"', f'content="{title_content}"', meta_Subject)
                        en_html = en_html.replace(meta_Subject, new_meta_Subject)
                    except:
                        pass

                    try:
                        headline = re.search('"headline":.+', en_html)[0]
                        new_headline = re.sub(r':.+', f': "{title_content}",', headline)
                        en_html = en_html.replace(headline, new_headline)
                    except:
                        pass

                    try:
                        keywords = re.search('"keywords": "Ahoana ny fomba hamoronana processor batch miaraka amin'ny python sy regex hanoloana ny tags html (parsing)",
                        new_keywords = re.sub(r':.+', f': "{title_content}",', keywords)
                        en_html = en_html.replace(keywords, new_keywords)
                    except:
                        pass

                    # canonical to meta og:url and @id
                    try:
                        canonical_content = re.search('', html)[1]
                    except:
                        pass

                    try:
                        og_url = re.search('', en_html)[0]
                        new_og_url = re.sub(r'content=".+"', f'content="{canonical_content}"', og_url)
                        en_html = en_html.replace(og_url, new_og_url)
                    except:
                        pass

                    try:
                        id = re.search('"@id":.+', en_html)[0]
                        new_id = re.sub(r':.+', f': "{canonical_content}"', id)
                        en_html = en_html.replace(id, new_id)
                    except:
                        pass

                    # meta description to og:description and description
                    try:
                        meta = re.search(']
                        meta_description = re.search(']
                    except:
                        pass

                    try:
                        og_description = re.search('', en_html)[0]
                        new_og_description = re.sub(r'content=".+"', f'content="{meta_description}"', og_description)
                        en_html = en_html.replace(og_description, new_og_description)
                    except:
                        pass

                    try:
                        description = re.search('"description": "Ahoana ny fomba hamoronana processor batch miaraka amin'ny python sy regex hanoloana ny tags html (parsing) |  Neculai Fantanaru",
                        new_description = re.sub(r':.+', f': "{meta_description}",', description)
                        en_html = en_html.replace(description, new_description)
                    except:
                        pass

                    try:
                        en_html = re.sub(', meta, en_html)
                    except:
                        pass

                    try:
                        en_html = re.sub('Ahoana ny fomba hamoronana processor batch miaraka amin'ny python sy regex hanoloana ny tags html (parsing)', title, en_html)
                    except:
                        pass
            except FileNotFoundError:
                continue

        print(f'{filename} parsed ({amount})')
        amount += 1
        if use_parse_folder:
            try:
                with open(os.path.join(english_folder2+r'', ''+filename), 'w', encoding='utf-8') as new_html:
                    new_html.write(en_html)
            except:
                os.mkdir(english_folder2+r'')
                with open(os.path.join(english_folder2+r'', ''+filename), 'w', encoding='utf-8') as new_html:
                    new_html.write(en_html)
        else:
            with open(os.path.join(english_folder2, 'parsed_'+filename), 'w', encoding='utf-8') as html:
                html.write(en_html)
  

That's all folks.

If you like my code, please SHARE IT

Azonao atao ihany koa ny mijery ny dikan-kaodyPowershellna python hafaVersion 3naVersion 4naVersion 5


Latest articles accessed by readers:

  1. An Eye To See And A Mind To Understand
  2. Turn Towards Me With An Eye Full Of Your Own Gaze
  3. The Snapshot Of Magic In God's Universe
  4. Rhythm Of My Heart

Donate via Paypal

Alternate Text

RECURRENT DONATION

Donate monthly to support
the NeculaiFantanaru.com project

SINGLE DONATION

Donate the desired amount to support
the NeculaiFantanaru.com project

Donate by Bank Transfer

Account Ron: RO34INGB0000999900448439

Open account at ING Bank

Join The Neculai Fantanaru Community



* Note: If you want to read all my articles in real time, please check the romanian version !

decoration
About | Site Map | Partners | Feedback | Terms & Conditions | Privacy | RSS Feeds
© Neculai Fântânaru - All rights reserved