Homework no. 8
Daily Dispatch to .tsv
After the same imports as last week, we define a little function that strips off .xml tags from any string we pass to it. Then we loop over every file. We split at the markers for articles <div3 type=
and then take the first of these items, and split it at <date value\=\"
. The next ten letters which come after this are the date we are interested in, and so we save it to the variable date
. Then we loop over the individual articles, splitting of the first tiny bit, which was left from the <div3
-tag, and keeping it, as it denotes the article type. In a similar fashion we get the content of the head and the article, to which we apply the remove_xml-function.
Then all variables of interest are appended to a dictionary we defined earlier. When the loop has concluded, we convert the dictionary to a dataframe, check whether everything looks nice, and then save it of as a tab-seperated file.
def remove_xml(x):
return re.sub('<[^>]*>', '', x)
directory = 'richmond'
results = {}
count = 0
for filename in tqdm_notebook(os.listdir(directory)):
if filename.endswith(".xml"):
with open(os.path.join(directory, filename), 'r',encoding='utf8') as f:
article_data = []
article_list = re.split('<div3 type=', f.read())
date = re.split('<date value\=\"',article_list[0])[1][0:10]
for a in article_list[1:]:
article_type = re.split('\"',a)[1]
article_header = remove_xml('<' +re.split('<\/head>',a)[0]).replace('\n','%%%%%')
article = remove_xml(re.split('<\/div3>',a)[0]).replace('\n','%%%%%')
results[count] = [date, article_type, article_header,article]
count += 1
results_df = pd.DataFrame.from_dict(results, orient='index', columns=['date', 'article_type', 'article_header','article'] )
results_df.to_csv('results.tsv', sep='\t',escapechar ='\\')