2010-07-02 16 views
3

J'ai une question que je suspecte est assez simple. J'ai le type suivant de la page à partir de laquelle je veux rassembler les informations contenues dans le dernier tableau (si vous faites défiler la liste vers le bas est celui dans la case « Procédure »):Scraping une table en utilisant BeautifulSoup

http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-2&language=EN

Le html pour la table que je veux gratter ressemble à ceci:

<tbody><tr class="doc_title"> 
<td style="background-image: url(&quot;/img/struct/navigation/gradient_blue.gif&quot;);" align="left" valign="top"><img src="/img/struct/functional/arrow_title_doc.gif" alt="" align="absmiddle" border="0" height="14" width="8"> <span style="font-weight: bold;">PROCEDURE</span></td><td style="background-image: url(&quot;/img/struct/navigation/gradient_blue.gif&quot;);" align="right" valign="top"> 
<table border="0" cellpadding="3" cellspacing="0" width="50"> 
<tbody><tr><td align="center"><a href="#top"><img src="/img/struct/functional/top_doc.gif" alt="" border="0" height="16" width="16"></a></td><td align="center"><img src="/img/struct/navigation/spacer.gif" alt="" border="0" height="10" width="15"></td><td align="center"><a href="#title2"><img src="/img/struct/functional/sort_up.gif" alt="" border="0" height="10" width="15"></a></td></tr></tbody></table></td></tr> 

<tr class="contents" valign="top"><td colspan="2"> 
<p></p><table style="border-collapse: collapse; width: 481.85pt;" align="center" cellspacing="0"> 
<tbody><tr style=""> 
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1"> 
<p style=""><span style="font-weight: bold;">Title</span></p> 
</td> 
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7"> 
<p style="">Mutual assistance for the recovery of claims relating to taxes, duties and other measures</p> 
</td> 
<td style="" rowspan="1" colspan="1"></td></tr> 

<tr style=""> 
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1"> 
<p style=""><span style="font-weight: bold;">References</span></p> 
</td> 
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7"> 
<p style=""><a href="http://ec.europa.eu/prelex/liste_resultats.cfm?CL=en&amp;ReqId=0&amp;DocType=COM&amp;DocYear=2009&amp;DocNum=0028">COM(2009)0028</a> – C6-0061/2009 – <a href="/oeil/FindByProcnum.do?lang=en&amp;procnum=CNS/2009/0007">2009/0007(CNS)</a></p> 
</td> 
<td style="" rowspan="1" colspan="1"></td></tr> 

<tr style=""> 
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1"> 
<p style=""><span style="font-weight: bold;">Date of consulting Parliament</span></p> 
</td> 
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7"> 
<p style="">16.2.2009</p> 
</td> 
<td style="" rowspan="1" colspan="1"></td></tr> 

<tr style=""> 
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1"> 
<p style=""><span style="font-weight: bold;">Committee responsible</span></p> 

<p style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Date announced in plenary</p> 
</td> 
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7"> 
<p style="">ECON</p> 

<p style="">19.10.2009</p> 
</td> 
<td style="" rowspan="1" colspan="1"></td></tr> 

<tr style=""> 
<td style="border-width: 0.75pt 1pt 0pt 0.75pt; border-style: solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1"> 
<p style=""><span style="font-weight: bold;">Committee(s) asked for opinion(s)</span></p> 

<p style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Date announced in plenary</p> 
</td> 
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.88%;" rowspan="1" colspan="2"> 
<p style="">CONT</p> 

<p style="">19.10.2009</p> 
</td> 
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2"> 
<p style="">JURI</p> 

<p style="">19.10.2009</p> 
</td> 
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2"> 
<p style="">&nbsp;</p> 
</td> 
<td style="border-width: 0.75pt 1pt 0pt 0pt; border-style: solid solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="1"> 
<p style="">&nbsp;</p> 
</td> 
<td style="" rowspan="1" colspan="1"></td></tr> 

<tr style=""> 
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1"> 
<p style=""><span style="font-weight: bold;">Not delivering opinions</span></p> 

<p style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Date of decision</p> 
</td> 
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.88%;" rowspan="1" colspan="2"> 
<p style="">CONT</p> 

<p style="">1.10.2009</p> 
</td> 
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2"> 
<p style="">JURI</p> 

<p style="">5.10.2009</p> 
</td> 
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2"> 
<p style="">&nbsp;</p> 
</td> 
<td style="border-width: 0.75pt 1pt 0.75pt 0pt; border-style: solid solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="1"> 
<p style="">&nbsp;</p> 
</td> 
<td style="" rowspan="1" colspan="1"></td></tr> 

<tr style=""> 
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1"> 
<p style=""><span style="font-weight: bold;">Rapporteur(s)</span></p> 

<p style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Date appointed</p> 
</td> 
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 20.59%;" rowspan="1" colspan="3"> 
<p style="">Theodor Dumitru Stolojan</p> 

<p style="">21.7.2009</p> 
</td> 
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 20.59%;" rowspan="1" colspan="2"> 
<p style="">&nbsp;</p> 
</td> 
<td style="border-width: 0.75pt 1pt 0.75pt 0pt; border-style: solid solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 20.59%;" rowspan="1" colspan="2"> 
<p style="">&nbsp;</p> 
</td> 
<td style="" rowspan="1" colspan="1"></td></tr> 

<tr style=""> 
<td style="border-width: 0.75pt 1pt 0pt 0.75pt; border-style: solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1"> 
<p style=""><span style="font-weight: bold;">Discussed in committee</span></p> 
</td> 
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.88%;" rowspan="1" colspan="2"> 
<p style="">10.11.2009</p> 
</td> 
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2"> 
<p style="">1.12.2009</p> 
</td> 
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2"> 
<p style="">21.1.2010</p> 
</td> 
<td style="border-width: 0.75pt 1pt 0pt 0pt; border-style: solid solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="1"> 
<p style="">&nbsp;</p> 
</td> 
<td style="" rowspan="1" colspan="1"></td></tr> 

<tr style=""> 
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1"> 
<p style=""><span style="font-weight: bold;">Date adopted</span></p> 
</td> 
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.88%;" rowspan="1" colspan="2"> 
<p style="">27.1.2010</p> 
</td> 
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2"> 
<p style="">&nbsp;</p> 
</td> 
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2"> 
<p style="">&nbsp;</p> 
</td> 
<td style="border-width: 0.75pt 1pt 0.75pt 0pt; border-style: solid solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="1"> 
<p style="">&nbsp;</p> 
</td> 
<td style="" rowspan="1" colspan="1"></td></tr> 

<tr style=""> 
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1"> 
<p style=""><span style="font-weight: bold;">Result of final vote</span></p> 
</td> 
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 12.94%;" rowspan="1" colspan="1"> 
<p style="">+:</p> 

<p style="">–:</p> 

<p style="">0:</p> 
</td> 
<td style="border-width: 0.75pt 1pt 0.75pt 0pt; border-style: solid solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 48.82%;" rowspan="1" colspan="6"> 
<p style="">39</p> 

<p style="">0</p> 

<p style="">1</p> 
</td> 
<td style="" rowspan="1" colspan="1"></td></tr> 

<tr style=""> 
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1"> 
<p style=""><span style="font-weight: bold;">Members present for the final vote</span></p> 
</td> 
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7"> 
<p style="">Burkhard Balz, Sharon Bowles, Udo Bullmann, Pascal Canfin, Nikolaos Chountis, George Sabin Cutaş, Leonardo Domenici, Derk Jan Eppink, Markus Ferber, Elisa Ferreira, Vicky Ford, José Manuel García-Margallo y Marfil, Jean-Paul Gauzès, Sylvie Goulard, Enikő Győri, Liem Hoang Ngoc, Eva Joly, Othmar Karas, Wolf Klinz, Jürgen Klute, Werner Langen, Astrid Lulling, Arlene McCarthy, Ivari Padar, Alfredo Pallone, Anni Podimata, Antolín Sánchez Presedo, Olle Schmidt, Edward Scicluna, Peter Simon, Peter Skinner, Theodor Dumitru Stolojan, Ivo Strejček, Kay Swinburne, Marianne Thyssen, Ramon Tremosa i Balcells</p> 
</td> 
<td style="" rowspan="1" colspan="1"></td></tr> 

<tr style=""> 
<td style="border-left: 0.75pt solid rgb(0, 0, 0); border-right: 1pt solid rgb(0, 0, 0); border-top: 0.75pt solid rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1"> 
<p style=""><span style="font-weight: bold;">Substitute(s) present for the final vote</span></p> 
</td> 
<td style="border-left: 0.75pt solid rgb(0, 0, 0); border-right: 1pt solid rgb(0, 0, 0); border-top: 0.75pt solid rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7"> 
<p style="">Marta Andreasen, Sophie Briard Auconie, David Casa, Danuta Jazłowiecka, Arturs Krišjānis Kariņš, Philippe Lamberts, Andreas Schwab</p> 
</td> 
<td style="" rowspan="1" colspan="1"></td></tr> 

<tr style=""> 
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 38.24%;" rowspan="1" colspan="1"></td> 
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 12.94%;" rowspan="1" colspan="1"></td> 
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 2.94%;" rowspan="1" colspan="1"></td> 
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 4.71%;" rowspan="1" colspan="1"></td> 
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 10.58%;" rowspan="1" colspan="1"></td> 
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 10%;" rowspan="1" colspan="1"></td> 
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 5.29%;" rowspan="1" colspan="1"></td> 
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 15.3%;" rowspan="1" colspan="1"></td> 
<td style="" rowspan="1" colspan="1"></td></tr> 
</tbody></table> 
</td></tr> 
</tbody> 

le problème que je suis confronté est que les balises pour les tables ne sont pas des identifiants (pour autant que je peux dire), donc je ne sais pas comment sélectionnez cette table et écrivez-en les informations. J'ai utilisé BeautifilSoup jusqu'ici pour obtenir d'autres informations sur le site Web, mais je ne sais pas comment gratter cette table.

Si quelqu'un peut me montrer comment procéder, je serais très reconnaissant!

Cordialement,

Thomas

Répondre

3

Vous pouvez trouver des éléments par d'autres attributs si vous êtes un peu intelligent. J'ai pris cette photo pour gratter vos données, et ce n'est probablement pas le meilleur – mais, ça vous rapproche.

La première chose que j'ai remarquée était que vous vouliez définitivement des données après la deuxième apparition du mot "PROCEDURE" (le premier étant le lien, le second étant l'en-tête). Donc, je partage sur ce point:

data = html.split("PROCEDURE", 2)[2] 

Alors, je cherchais <td> balises avec rowspan=1:

bs = BeautifulSoup.BeautifulSoup(data) 
tds = bs.findAll("td", { "rowspan": 1 }) 

Se rapprocher ...

>>> tds[0].text 
u'Title' 
>>> tds[1].text 
u'Mutual assistance for the recovery of claims relating to taxes, duties and other measures' 
>>> tds[3].text 
u'References' 
>>> tds[4].text 
u'COM(2009)00282009/0007(CNS)2009 a>' 

Notez que je sautais index 2 dans tds, car ils utilisent une entretoise ou quelque chose (c'est vide). Quoi qu'il en soit, c'est un début. Le vrai truc que j'ai trouvé avec BeautifulSoup était de ne faire que nourrir les données dans la zone que vous recherchez, car il y a moins à parcourir. Il se targue aussi d'accepter de mauvaises intrants, alors n'ayez pas peur de nourrir les ordures. Je suis allé un peu plus loin dans la liste des éléments, et ce n'est pas parfait. Vous aurez besoin d'affiner la recherche, car ils ont <td> éléments dans <td> s pour les valeurs.

+0

Salut Jed, Merci beaucoup pour l'aide, c'est une belle illustration de la façon de procéder. Je ne veux pas prendre trop de temps, mais pourriez-vous me montrer comment mettre les éléments de la table en format tableur? Je sais que c'est probablement beaucoup de travail, donc si vous n'avez pas le temps c'est très bien. Le meilleur, Thomas. –