jak usunąć element w lxml

Question 1

Muszę całkowicie usunąć elementy na podstawie zawartości atrybutu, używając lxml języka Python. Przykład:

import lxml.etree as et

xml="""
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>
"""

tree=et.fromstring(xml)

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
  #remove this element from the tree

print et.tostring(tree, pretty_print=True)

Chciałbym to wydrukować:

<groceries>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>

Czy istnieje sposób, aby to zrobić bez przechowywania zmiennej tymczasowej i ręcznego drukowania do niej, jak:

newxml="<groceries>\n"
for elt in tree.xpath('//fruit[@state=\'fresh\']'):
  newxml+=et.tostring(elt)

newxml+="</groceries>"

Question 2

Użyj removemetody xmlElement:

tree=et.fromstring(xml)

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
  bad.getparent().remove(bad)     # here I grab the parent of the element to call the remove directly on it

print et.tostring(tree, pretty_print=True, xml_declaration=True)

Gdybym miał porównać z wersją @Acorn, moja będzie działać, nawet jeśli elementy do usunięcia nie znajdują się bezpośrednio pod węzłem głównym twojego xml.

Question 3

Szukasz removefunkcji. Wywołaj metodę usuwania drzewa i przekaż jej element podrzędny do usunięcia.

import lxml.etree as et

xml="""
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <punnet>
    <fruit state="rotten">strawberry</fruit>
    <fruit state="fresh">blueberry</fruit>
  </punnet>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>
"""

tree=et.fromstring(xml)

for bad in tree.xpath("//fruit[@state='rotten']"):
    bad.getparent().remove(bad)

print et.tostring(tree, pretty_print=True)

Wynik:

<groceries>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>

Question 4

Spotkałem jedną sytuację:

<div>
    <script>
        some code
    </script>
    text here
</div>

div.remove(script)usunie text hereczęść, której nie chciałem.

podążając za odpowiedzią tutaj stwierdziłem, że etree.strip_elementsjest to lepsze rozwiązanie dla mnie, które możesz kontrolować, czy usuniesz tekst za pomocą with_tail=(bool)param.

Ale nadal nie wiem, czy to może używać filtru xpath dla tagu. Po prostu umieść to dla poinformowania.

Oto dokument:

strip_elements (tree_or_element, * tag_names, with_tail = True)

Usuń wszystkie elementy z podanymi nazwami tagów z drzewa lub poddrzewa. Spowoduje to usunięcie elementów i całego poddrzewa, w tym wszystkich atrybutów, zawartości tekstowej i elementów podrzędnych. Usunie również tekst końca elementu, chyba że jawnie ustawisz with_tailopcję argumentu słowa kluczowego na False.

Nazwy tagów mogą zawierać symbole wieloznaczne, jak w _Element.iter.

Zauważ, że nie spowoduje to usunięcia przekazanego elementu (ani elementu głównego ElementTree), nawet jeśli pasuje. Będzie leczyć tylko swoich potomków. Jeśli chcesz dołączyć element główny, sprawdź nazwę jego znacznika bezpośrednio przed wywołaniem tej funkcji.

Przykładowe użycie:
   strip_elements(some_element,
       'simpletagname',             # non-namespaced tag
       '{http://some/ns}tagname',   # namespaced tag
       '{http://some/other/ns}*'    # any tag from a namespace
       lxml.etree.Comment           # comments
       )

Question 5

Jak już wspomniano, możesz użyć remove()metody do usunięcia (pod) elementów z drzewa:

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
  bad.getparent().remove(bad)

Ale usuwa element, w tym jego tail, co jest problemem, jeśli przetwarzasz dokumenty o mieszanej zawartości, takie jak HTML:

<div><fruit state="rotten">avocado</fruit> Hello!</div>

Staje się

<div></div>

To jest chyba to, czego nie zawsze chcesz :) Stworzyłem funkcję pomocniczą, aby usunąć tylko element i zachować jego ogon:

def remove_element(el):
    parent = el.getparent()
    if el.tail.strip():
        prev = el.getprevious()
        if prev:
            prev.tail = (prev.tail or '') + el.tail
        else:
            parent.text = (parent.text or '') + el.tail
    parent.remove(el)

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
    remove_element(bad)

W ten sposób zachowa tekst ogona:

<div> Hello!</div>

Question 6

Możesz również użyć html z lxml, aby rozwiązać ten problem:

from lxml import html

xml="""
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>
"""

tree = html.fromstring(xml)

print("//BEFORE")
print(html.tostring(tree, pretty_print=True).decode("utf-8"))

for i in tree.xpath("//fruit[@state='rotten']"):
    i.drop_tree()

print("//AFTER")
print(html.tostring(tree, pretty_print=True).decode("utf-8"))

Powinien to wypisać:

//BEFORE
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>


//AFTER
<groceries>

  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>

  <fruit state="fresh">peach</fruit>
</groceries>