[Résolu] Problèmes de Regex

Question

PapyJP

31 Oct 2020 à 11:21

Bonjour à tous
Je suis tombé sur le problème suivant:
J'ai un texte qui contient des adresses web, quelque chose comme:


text = "Vous trouverez des informations à  https://adresse1,   http://adresse2  ou  https://adresse3."

Mon objectif: remplacer les adresses par des liens, comme le fait le forum Alsacreations et beaucoup d'autres sites web.
J'ai essayé sans succès d'écrire une regex qui me permette de récupérer le tableau


["Vous trouverez des informations à ", "https://adresse1", ", ", "http://adresse2", " ou ", "https://adresse3", "."]

J'ai fini par écrire:


text = text.replace(/\s+/g, ' ');
words = text.split(' ');

En analysant les "mots" je suis arrivé au résultat désiré car la demande était urgente (fin de mois) mais j'aimerais résoudre ce problèmes de regex.
Merci de votre aide

parsimonhi

Modérateur

31 Oct 2020 à 15:38

Bonjour,

Je ne suis pas sûr d'avoir compris la question.

En supposant qu'on a déjà récupéré toutes les chaines qui sont du genre de text via une autre regex, il me semble qu'on peut faire :


$text = "Vous trouverez des informations à   https://adresse1,     http://adresse2   ou   https://adresse3.";
 
if(preg_match_all("/(Vous trouverez des informations à )|(https?[^\s]+[^\s,.])|(, )|( ou )/",$text,$m))
{
	foreach($m[0] as $a)
		echo "\"".$a."\"<br>";
}
else echo "Pas de résultat !<br>";

EDIT: la même chose, mais en javascript, avec le résultat écrit dans la console


var m, text = "Vous trouverez des informations à   https://adresse1,   https://adresse2   ou    https://adresse3.";
 
if(m=text.match(/(Vous trouverez des informations à )|(https?[^\s]+[^\s,.])|(, )|( ou )/g))
{
	m.forEach(e =>  console.log(e));
}
else console.log("Pas de résultat !");

Amicalement,
Modifié par parsimonhi (31 Oct 2020 - 16:03)

PapyJP

31 Oct 2020 à 18:33

Merci de ta réponse
Je pense que j'ai mal expliqué le problème, je vais essayer autrement.
Le propriétaire du site écrit le contenu en utilisant un HTML simplifié.
Voici le véritable code (https://tests.osirisnet.net/news/@n_10_20.htm lignes 83 à 98)


 <section lang="en">
             <p class="top10"><strong>(1) An Android app provides reliable and user-friendly information on more than 30000 Egyptian words:</strong><br>
        https://play.google.com/store/apps/details?id=com.aed_ancientegyptiandictionary<br>
 
You can search for transcription, for German or English translation, or for hieroglyphs used for the spelling of the word forms. Please check the tutorials:  https://youtu.be/_s58Ud5rB7c  and  https://youtu.be/bp7MYCjavOs</p>
 

<p class="top10"><strong>(2) The lexical entries are collected on the website</strong>  https://simondschweitzer.github.io/aed/<br>
 
or in a three-volume PDF (don't miss them!):  https://doi.org/10.5281/zenodo.4073311,  <br>  https://doi.org/10.5281/zenodo.4073317,  and  https://doi.org/10.5281/zenodo.4073321<br>
 
The entries offer selected references, collocation partners, hieroglyphic spellings, information on roots, on grammatical usage, on chronological and geographical distribution. For an overview:  https://youtu.be/O357NG49LyQ</p>
 

<p class="top10"><strong>(3) The approximately 11000 texts containing the references are digitally published</strong> in TEI-XML and thus guarantee verifiability:  http://doi.org/10.5281/zenodo.3580939<br>
 
There is a schema representing Egyptian texts in TEI. You can use it for your own text encodings:<br>  https://github.com/simondschweitzer/aed-tei/blob/v1.0/files/aed_schema.xsd<br>
 
Furthermore, a thesaurus for object type terms, for date terms and more is available:<br>  https://github.com/simondschweitzer/aed-tei/blob/v1.0/files/thesaurus.xml<br>
 
I am currently developing a digital tool in ANNIS (https://corpus-tools.org/annis/) that makes the data accessible for corpus linguistic research. A beta version will be presented soon.<br>
This data and the data of the AED project are published under the free license CC-BY-SA 4.0, which allows to use them for further research. </p>

    </section>

C'est un expert en égyptologie, pas en écriture de HTML.
Mon job ne consiste pas à transformer son HTML en quelque chose de plus correct, mais à écrire des fonctions JavaScript qui lui donnent une meilleure allure

Pour faire ce texte, il a fait un copier/coller à partir d'une page HTML, je suppose (et non pas à partir du source).

Comme cette <section> contient de nombreuses adresses web, je me suis dit qu'il serait bien de faire en sorte que ces adresses soient remplacées par des liens vers les pages en question.

Le script de transformation est lancé à la fin du chargement de la page, il fait pas mal de transformations, celle-ci n'est qu'une transformation de plus.

Ce que fait le script:
1) rechercher les textElement contenant la chaîne de caractère "https://..."
2) remplacer ces textElements par des documentFragment contenant les parties "non adresses" sous forme de texElement et les partie "adresses" par des <a href="https://...">https://...</a>

Le code se trouve dans https://tests.osirisnet.net/common.js lignes 1499 à 1549

Comme je l'ai expliqué plus haut, la solution que j'ai fini par utiliser pour traiter le problème est
1) de décomposer le contenu du textElement initial en "mots" par un split(" ")
2) de regarder le contenu de chaque "mot", regarder s'il commence ou non par https:// et remplir le documentFragment mot à mot.

Ça marche, pas de problème, mais j'aurais préféré pouvoir couper la chaîne de caractère en segments plus longs plutôt que de le faire mot à mot par un

text.match(/.../g)

La question est d'écrire la Regex correspondante.
Modifié par PapyJP (31 Oct 2020 - 18:34)

parsimonhi

Modérateur

31 Oct 2020 à 21:02

Bonjour,

Ceci me semble à peu près fonctionner (à ré-adapter dans ton contexte évidemment) :

<!DOCTYPE html>
<html lang="fr">
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body>

<p class="top10"><strong>(1) An Android app provides reliable and user-friendly information on more than 30000 Egyptian words:</strong><br>
 https://play.google.com/store/apps/details?id=com.aed_ancientegyptiandictionary<br>
 
 
You can search for transcription, for German or English translation, or for hieroglyphs used for the spelling of the word forms. Please check the tutorials:   https://youtu.be/_s58Ud5rB7c   and   https://youtu.be/bp7MYCjavOs</p>
 
 
<p class="top10"><strong>(2) The lexical entries are collected on the website</strong>   https://simondschweitzer.github.io/aed/<br>
 
 
or in a three-volume PDF (don't miss them!):   https://doi.org/10.5281/zenodo.4073311,   <br>   https://doi.org/10.5281/zenodo.4073317,   and   https://doi.org/10.5281/zenodo.4073321<br>
 
 
The entries offer selected references, collocation partners, hieroglyphic spellings, information on roots, on grammatical usage, on chronological and geographical distribution. For an overview:   https://youtu.be/O357NG49LyQ</p>
 

<p class="top10"><strong>(3) The approximately 11000 texts containing the references are digitally published</strong> in TEI-XML and thus guarantee verifiability:   http://doi.org/10.5281/zenodo.3580939<br>
 
 
There is a schema representing Egyptian texts in TEI. You can use it for your own text encodings:<br>   https://github.com/simondschweitzer/aed-tei/blob/v1.0/files/aed_schema.xsd<br>
 
 
Furthermore, a thesaurus for object type terms, for date terms and more is available:<br>   https://github.com/simondschweitzer/aed-tei/blob/v1.0/files/thesaurus.xml<br>
 
 
I am currently developing a digital tool in ANNIS (https://corpus-tools.org/annis/) that makes the data accessible for corpus linguistic research. A beta version will be presented soon.<br>
This data and the data of the AED project are published under the free license CC-BY-SA 4.0, which allows to use them for further research. </p>

<script>
function setLinksForOneArticle(a)
{
	var r=new RegExp('(https?[^\\s<]+[^\\s<,;:.\\)])','g');
	a.innerHTML=a.innerHTML.replace(r,"<a href=\"$1\">$1</a>");
}
function setLinks() {
	var articles = document.querySelectorAll("p");
	for(var i = 0; i < articles.length; i++) setLinksForOneArticle(articles[i]);
}
setLinks();
</script>
</body>
</html>

Amicalement,
Modifié par parsimonhi (31 Oct 2020 - 21:28)

PapyJP

1 Nov 2020 à 08:46

Merci, ça marche!

Sujet clos