TL;DR
from xml.dom import minidom
parsedXML = minidom.parseString("[your xml here]")
elements_by_tag = parsedXML.getElementsByTagName('[tagname here]') # get list of elms with tagname
print(elements_by_tag[0].firstChild.nodeValue) # print inner text value of an element
Introduction
I recently spent a few hours refactoring some of the backend code on my personal website. In changing the backend, I wanted to make sure the refactored code worked the same way as my old code.
To do this, I wrote a unit test in Python that sent a request to every URL on my site running the old backend code and the corresponding URL with my local server running the new code, to make sure they worked exactly the same.
As with many other sites, I have a sitemap which lists all of the URLs on the site with their titles and other information for search engines and other bots. Since this sitemap is programmed to output all of the URLs on my site (excluding error pages like the 404 page), this seemed like a perfect way to get a list of URLs on my site to test the new backend code.
Sitemaps are written in XML, a language very similar to HTML, with the same tag and element based syntax.
So, to go about getting all the URLs from my sitemap, I had to get the raw XML from the sitemap, and then parse that XML to get items of a particular tag name (loc
in this case).
1. Parse XML with XML.DOM Minidom
To parse the XML, we'll use Python's default xml.dom.minidom module.
The minidom
module has a handy .parseString
method to parse XML from a Python string. The method returns an object with other methods for getting information from the parsed XML, and selecting various elements.
from xml.dom import minidom
parsedXML = minidom.parseString("""
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://gabrielromualdo.com/</loc>
<changefreq>weekly</changefreq>
<priority>0.9</priority>
</url>
<url>
<loc>https://gabrielromualdo.com/about/</loc>
<changefreq>monthly</changefreq>
<priority>0.6</priority>
</url>
</urlset>
""")
Now we can call various functions on the parsedXML
variable to get information from elements in the XML.
2. Get Elements by Tag Name
Getting elements by tag name from parsed XML in xml.dom.minidom is pretty simple: use the getElementsByTagName
function.
For example, in my case I want a list of elements with the tag name of loc
:
elements = parsedXML.getElementsByTagName('loc') # any tag name in place of 'loc' is valid
3. Get the Inner Text of a Particular Element
Now that I had a list of loc
elements, I needed to get text of each of those elements and add those to their own list.
To get text from an element in xml.dom.minidom, use the firstChild.nodeValue
property of the element, like this: element.firstChild.nodeValue
.
In my case, here's how I looped through each element and added the text content to a variable:
elements = parsedXML.getElementsByTagName('loc')
locations = []
for element in elements: # loop through 'loc' elements
locations.append(element.firstChild.nodeValue); # get innertext of each element and add to list
print(locations) # https://gabrielromualdo.com/ https://gabrielromualdo.com/about/
In getting the text of an element, the firstChild
property gets the text node from the element, and the nodeValue
property gets the value of that particular text node.
Conclusion
You can see the final source code of this article here at my tutorials repo.
I hope you enjoyed this post and found it useful in parsing XML from a URL in Python. I spent some time reading documentation myself to get this working myself, so I thought I'd make this post to help anyone out.
Thanks for scrolling.
— Gabriel Romualdo
Top comments (0)