Beautiful soup parse published date of article on medium python

When working with web scraping in Python, one common task is to extract specific information from a webpage. In this case, we want to parse the published date of an article on Medium using the Beautiful Soup library.

Option 1: Using CSS Selectors

from bs4 import BeautifulSoup
import requests

# Make a request to the webpage
url = "https://medium.com/article-url"
response = requests.get(url)

# Create a BeautifulSoup object
soup = BeautifulSoup(response.text, "html.parser")

# Use CSS selectors to find the published date
published_date = soup.select_one(".postMetaInline time")["datetime"]

print(published_date)

In this option, we use the select_one() method from Beautiful Soup to find the first element that matches the CSS selector “.postMetaInline time”. This selector targets the <time> element within the <div class="postMetaInline"> class, which usually contains the published date of an article on Medium.

Option 2: Using Class Names

from bs4 import BeautifulSoup
import requests

# Make a request to the webpage
url = "https://medium.com/article-url"
response = requests.get(url)

# Create a BeautifulSoup object
soup = BeautifulSoup(response.text, "html.parser")

# Find the published date by class name
published_date = soup.find(class_="postMetaInline").find("time")["datetime"]

print(published_date)

In this option, we use the find() method from Beautiful Soup to find the first element that has the class name “postMetaInline”. Then, we use another find() method to find the <time> element within that class, and finally, we extract the value of the “datetime” attribute.

Option 3: Using XPath

from bs4 import BeautifulSoup
import requests
from lxml import etree

# Make a request to the webpage
url = "https://medium.com/article-url"
response = requests.get(url)

# Create a BeautifulSoup object
soup = BeautifulSoup(response.text, "html.parser")

# Convert the BeautifulSoup object to an lxml object
lxml_object = etree.HTML(str(soup))

# Use XPath to find the published date
published_date = lxml_object.xpath('//div[@class="postMetaInline"]/a/time/@datetime')[0]

print(published_date)

In this option, we first convert the BeautifulSoup object to an lxml object using the etree.HTML() function from the lxml library. Then, we use an XPath expression to find the <time> element within the <a> element, which is within the <div class="postMetaInline"> element. Finally, we extract the value of the “datetime” attribute.

Among these three options, the best one depends on the specific webpage structure and the reliability of the information you want to extract. Option 1 and Option 2 are more straightforward and rely on the class names or CSS selectors, which are usually more stable. Option 3, using XPath, provides more flexibility but may require additional steps to convert the BeautifulSoup object to an lxml object. It is recommended to try different options and choose the one that works best for your specific case.

Rate this post

4 Responses

  1. Option 1: Using CSS Selectors, Option 2: Using Class Names, Option 3: Using XPath. Which one should I choose? Help!

Leave a Reply

Your email address will not be published. Required fields are marked *

Table of Contents