Welcome

Wrangling the BOJ Data

October 31, 2024

This post will center around the data wrangling required to create a comprehensive database of BOJ JGB holding reports (2001 – present). I don’t do this professionally, though maybe(?) I would enjoy it, so don’t expect best practices here. The methods used to download, clean, combine, and check data are described below.

To start, let’s introduce the data reporting styles. Over the 25 years of data reporting the BOJ has provided five distinct styles of report, which can be seen in the table below.

DateZipped Folder?Download TypeData display
2001 – Nov. 2010NoN/AHTML table
Dec. 2010 – Mar. 2013NoPDFPDF
Apr. 2013 – Apr. 2014YesExcel3 columns of data
May 2014 – Dec. 2016YesExcel1 column of data
2017 – PresentNoExcel1 column of data
Different ways the Bank of Japan has reported JGB holdings over the years.

I took screenshots of the four distinct reporting styles – HTML, PDF, Excel (3 column), Excel (1 column). Each style of data reporting required a unique download solution.

2001 – Nov. 2010 reporting style. HTML tables on the BOJ website.
Dec. 2010 – Mar. 2013 reporting style. PDFs…bleh.
Apr. 2013 – Apr. 2014 reporting style. Excel sheet with 3 data columns.
May 2014 – Present reporting style. Excel sheet with a single data column.

Priority number one was get the data onto my computer in the best format possible. I didn’t always succeed at the formatting part. For example, when I downloaded the HTML tables into a CSV file, it was easiest to have it happen as two columns, which added another spreadsheet style to deal with later when combining. See below.

New spread sheet style just dropped. Result of turning the HTML tables into CSV files.

As I started to download the data piecemeal and in different formats, I had decisions to make. Work now, or work later? What I mean by that is, should I get all the data in the exact same format now, or worry about it later when I’m reading into a database?

I chose work later, mostly. Besides a little bit of focus to keep the variety of spreadsheets from not growing too large, I accepted variations in spreadsheets.

After my first pass I had a gap: 2010 – 2014 was ugly. The PDFs and 3 column excel sheets were killing the vibe, so I left them for later.

Aside from that four year stretch, though, the data was looking good! The ‘modern reporting style’ from 2014 on of CSVs with one data column was its own group, and the 2001 – 2010 HTML tables converted to CSVs with two data columns was a separate group.

Reading CSVs into a Database

Currently I have a really crude ‘database’ on my computer. The data is stored in folders labeled ‘originals’, ‘CSVs’, and ‘backups’. The originals are exactly as I downloaded them, the CSVs are what the analysis scripts look at, and are often reformatted from the original, and the backup folder is just a copy of the CSV folder.

Remember all that work I said I chose to put off for later? Well now is later.

A Python script called ‘create-master-dicts.py’ creates a dictionaries from the CSV folder’s cleaned data. Due to me not downloading and formatting all the data the same, this script has to do some heavy lifting.

As I started writing ‘create-master-dicts.py’ there were questions to answer: how is the dictionary structured? Do I include totals? What do I do with the jibberish text entries? Do I reformat the dates?

In the end master dictionary structure ended up looking like this:

{reportdate: {cusip: holdingvalue, cusip2: holdingvalue, … }, reportdate2: {cusip: holdingvalue, cusip2: holdingvalue, …}, reportdate3: {cusip: holdingvalue, cusip2: holdingvalue, …}, …}

Yes, it is a dictionary of dictionaries… but it made the most sense! Since reports are named by the date lookup is easy, and the dictionary values for reports simply contain all the CUSIPS and face value data.

I ended up finding it useful to add totals to dictionaries as well. Total face value for each bond type, and total face value of all bond types.

Dealing with 2010 – 2014

The problem era. They really must have had something to hide in here because this shit is messy for no reason.

Converting the PDFs was a huge challenge – there is no great free software for turning a PDF into a spread sheet. I eventually found a pretty good one (shoutout Zamzar), but it only allowed 2 free conversions in a 24hr period. I had 28 PDFs to convert and two weeks at the beach… everyday that I remembered I converted two PDFs to spreadsheets.

While the conversion was pretty good, I did have to go in and reformat the CSV file and make sure everything was in the right place. Interestingly, the same spots would almost always be off. See an example below.

The final version of these CSVs ended being a two column format that matched the way CSVs created from HTML tables (2001 – 2010) looked.

To fix the three column Excel files… I had to do more manual work. A couple hours of moving things around and we had more double column formatted CSVs, ready to go!

Dealing with the 2010 – 2014 data was a relief to have done, as it meant the enter historical dataset was complete! I archived the datadownloader Python file, made sure the ‘CSVs’, ‘backups’, and ‘originals’ files were flawless, and moved on.

This isn’t to say there weren’t loose ends. The Inflation Indexed and Floating Rate bonds were problematic due to the text becoming jibberish. The Green Bonds which started getting issued in 2023 were left for later as well.

Continually Updating Data

Once the historical data set was complete and reading into the master dictionaries well, I had to worry about new data.

I took a chunk of the datadownloader Python script and created ‘dataupdater.py’, which checks for new reports when run by comparing the latest reports to existing ones in the local CSV folder. Since all the new reports are the same format (single column Excel), it is easy to download, convert to CSV, save backups and originals.

For a while I used a raspberrypi cronjob to continually check for new reports, but given how quick it is to run a dataupdater.py to check, the pi has not been a priority. It works, it’s cool, I wrote about it here, it isn’t that useful at the moment.

Expanding Create Master Dicts

The original version of ‘create-master-dicts.py’ created one master dictionary in the format shown above. It became apparent to me that it would be useful to have other master dictionaries – namely ones that made exploring change in holdings between reports easy.

To do this I added a ‘master-purchase-dict’ and a ‘master-difference-dict’. These are formatted the exact same way as the master dictionary, but the CUSIP and total values are either the change in value from last report (difference dict) or the purchase amount since last report (ignoring roll offs or sales, purchase dict).

Was creating additional master dictionaries necessary? No.

What it did was save me a lot of work later on. With the difference dictionary, I am able to easily chart change in holdings by report date, a key factor to look at to see if the BOJ is increasing or decreasing its total holdings. Instead of lines of code doing dictionary math to get differences in the same place I’m setting up data frames and charts, I just access the master difference dictionary!

The master dictionaries are used in every aspect of the data analysis and exploration. The standardized structure makes it easy for me to get the needed data, and provides a good foundation for writing more in-depth analysis programs.

This project has shown me data wrangling is often a huge piece of the work. It took me months of working on this hobby project to get all the data together, cleaned, and checked. While this piece has covered the majority of the work done and big challenges, there are dozens of other small road blocks, anomalies, and hiccups that were overcome.

Comments

Join the discussion on 'Wrangling the BOJ Data'

Leave a Reply