| theme | _class | paginate | backgroundColor | footer | marp |
|---|---|---|---|---|---|
gaia |
lead |
true |
Computational Thinking and Social Science | :copyright: Matti Nelimarkka | 2023 | Sage Publishing |
true |
- Connect to the Internet via libraries and extract meaningful content from it for your work.
- Use libraries to collect data from hypertext documents.
- Read application programming interface (API) documentation.
- Collect data from online APIs that do not require authentication.
- Read and store data in .json format.
- Compare the .csv and .json forms of data storage.
- Increasingly we repurpose data for our research practices.
- We discuss now how computers can help in collection, storage and manipulation of the data.
- Remember: No quantity of data can rectify a dull or, worse, irrelevant research question.
Connecting to web resources with the help of requests library.
import requests
## collect the Web site example.com
response = requests.get('http://www.example.com')
website_content = response.text
print( website_content )<!doctype html>
<html>
<head>
<title>Example Domain</title>
<meta charset=”utf-8” />
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this domain in literature
without prior coordination or asking for permission.</p>
<p><a href=”https://www.iana.org/domains/example”>More information...</a></p>
</div>
</body>
</html>-
separates paragraphs
- indicates a link
Knowing the semantic meaning of HTML tags, websites can be further analysed using other libraries.
For Python, use BeautifulSoup and in some more difficult cases selenium.
Web services define APIs: in essence these are grammars
- determining and identifying what data can be collected from the service
- how data are to be requested properly
- the format in which the data get returned to the requester
Requests:
https://data.police.uk/api/crimesstreet/ all-crime?lat=51.5073&lng=-0.171505
Response (partial):
{"category":"anti-social-behaviour",
"location_type":"Force",
"location":{"latitude":"51.517535","street":{"id":1670905,"name":"On or near A4206"},"
"longitude":"-0.182180"},
"context":"",
"outcome_status":null,
"persistent_id":"","id":104301433,"location_subtype":"","month":"2022-08"}- Comma-separated values (CSV) compared to dictionary-style data format JSON.
- CSV requires all rows must have the same number of columns
- JSON allows more complex structures, like lists to be used
import csv
for row in csv.reader( open("emperors.csv") ):
name = row[0]
birth_year = float( row[1] )
death_year = float( row[2] )
start_of_reign = float( row[3] )
end_of_reign = float( row[4] )[
{
"id": 1,
"text": "This post has no Likes",
"likes": []
}, {
"id": 2,
"text": "This post has two Likes",
"likes": [
"John Smith",
"Jane Smith"
]
}
]import json
data = json.load( open('data.json') )
for post in data:
print( post['id'], post['text'] )
liked_by = post['likes']
for user in liked_by:
print(' Liked by', user )- Always archive the original ‘raw’ data before any cleaning or processing.
- Document each processing and wrangling step.
- In the end, rerun the processing steps to ensure you have not forgotten anything.
- What kinds of steps are carried out in a Web-scraping process?
- Why are APIs preferable to Web scraping for collection of data?
- How can scholars access and use data not found online?
- Why might the .json data format preferred over .csv files and binary dumps?
- What should you remember when working with data?

