2010-03-18 13 views
-2

Je veux trouver la balise span entre la balise LI et ses attributs. Essayer avec du savon beautful mais pas de chance. Détails de mon code. Est-ce quelqu'un me pointer droit methodlogyJe veux trouver la balise span entre la balise LI et ses attributs, mais pas de chance

Dans ce code, ma fonction getId doit me revenir id = "0_False-2"

Tout savoir de bonne méthode?


from BeautifulSoup import BeautifulSoup as bs 
import re 

html = '<ul>\ 
<li class="line">&nbsp;</li>\ 
<li class="folder-open-last" id="0">\ 
<img style="float: left;" class="trigger" src="/media/images/spacer.gif" border="0">\ 
<span class="text" id="0_False">NOC</span><ul style="display: block;"><li class="line">&nbsp;</li><li class="doc" id="1"><span class="active text" id="0_False-1">PNQAIPMS1</span></li><li class="line">&nbsp;</li><li class="doc-last" id="2"><span class="text" id="0_False-2">PNQAIPMS2</span></li><li class="line-last"></li></ul></li><li class="line-last"></li>\ 
</ul>' 


def getId(html, txt): 
soup = bs(html) 
soup.findAll('ul',recursive=False) 
head = soup.contents[0] 
temp = head 
elements = {} 
while True: 
    # It temp is None that means no HTML tags are available 
    if temp == None: 
    break 
    #print temp 
    if re.search('li', str(temp)) != None: 
    attr = str(temp.attrs).encode('ascii','ignore') 
    attr = attr.replace(' ', '') 
    attr = attr.replace('[', '') 
    attr = attr.replace(']', '') 
    attr = attr.replace(')', '') 
    attr = attr.replace('(', '') 
    attr = attr.replace('u\'', '') 
    attr = attr.replace('\'', '') 
    attr = attr.split(',') 
    span = str(temp.text) 

    if span == txt: 
    return attr[3] 

    temp = temp.next 
    else: 
    temp = temp.next 


id = getId(html,"PNQAIPMS2") 
print "ID = " + id 

Répondre

0

Je suis sûr que quelqu'un peut vous montrer le chemin BS, mais voici mon approche. Simplement la vieille manipulation de chaîne de Python.

html = '<ul>\ 
<li class="line">&nbsp;</li>\ 
<li class="folder-open-last" id="0">\ 
<img style="float: left;" class="trigger" src="/media/images/spacer.gif" border="0">\ 
<span class="text" id="0_False">NOC</span><ul style="display: block;"><li class="line">&nbsp;</li><li class="doc" id="1"><span class="active text" id="0_False-1">PNQAIPMS1</span></li><li class="line">&nbsp;</li><li class="doc-last" id="2"><span class="text" id="0_False-2">PNQAIPMS2</span></li><li class="line-last"></li></ul></li><li class="line-last"></li>\ 
</ul>' 


def getId(html, txt): 
    for LI in html.split("</li>"): 
     if "span" in LI: 
      for CL in LI.split("span"): 
        if "class" in CL and "id" in CL and "text" in CL and txt in CL: 
         return CL.split("id=")[-1].split('">')[0].replace('"',"") 

print "id for PNQAIPMS2: " , getId(html,"PNQAIPMS2") 
print "id for NOC: ",getId(html, "NOC") 
print "id for PNQAIPMS1: ",getId(html, "PNQAIPMS1") 

sortie

$ ./python.py 
id for PNQAIPMS2: 0_False-2 
id for NOC: 0_False 
id for PNQAIPMS1: 0_False-1 
+0

Bien que je cherchais une façon de BS, mais cela m'a vraiment aidé beaucoup. Merci pour votre aide. – Mahesh