Any suggestions on extracting information from Companies House XBRLs?

Clinton · Dec 30, 2019

Any suggestions on extracting information from iXBRLs from CH?

I have a bunch of companies whose financials I need to monitor. I need some way of getting iXBRL files from Companies House and extracting the data from there to put into a spreadsheet / database. I'm aware of the CH API and there are one or two parsing projects in development (examples: https://github.com/ONSBigData/parsing_company_accounts and https://www.codeproject.com/Articles/1227765/Parsing-XBRL-with-Python). Most use the python BeautifulSoup library. But by the coders' honest admission, their XBRL parsers are a bit flaky.

And I don't want to get into OCR-ing PDFs.

Anyone has any other suggestions?

There needs to be a way of monitoring CH for changes to the companies I'm tracking. When there's a new iXBRL, I would need it downloaded, parsed and the data added to my spreadsheet / database so I can make graphs showing, for example, how the company's net asset figure has changed over the years. (I have multiple VPS accounts and can setup a cron job if that would make the monitoring bit easier).

The daily zip file from Companies House is not that useful in this context (http://download.companieshouse.gov.uk/en_accountsdata.html).

My ideal solution would require visiting individual company pages at CH and downloading the iXBRLs. I tried places like Freelancer to find someone to do this and nothing turns up in searches of XBRL so no easy way to find any talent.

https://www.bizdb.co.uk/ does collate and publish some data, but it doesn't seem to do that great a job. For several companies I checked in their database there were gaps in the tables where numbers needed to be. And, besides, bizbd don't offer a licence to get what I want, so I'd have to scrape their data, page by page. This is not something that's worth doing given the poor quality of data there.

Help!

The Byre · Dec 30, 2019

DueDil claim to cleanse their data, but everybody seems to have gaps -

https://www.duedil.com/plans

Gordon - Commercial Finance · Dec 30, 2019

Sorry Clinton all those letters mean very little to me, but in terms of summarising the data, CreditSafe does it well. Paid subscription required, but it’s well worth the cost.

They seem to have found a way to summarise the financials from the companies house pdfs, so if you can find a way to figure out what they’ve done, that would be ideal I guess, but if not then it might be a better source of data in itself.

Mark T Jones · Dec 30, 2019

I’d go with the above. Personally I use cocredo - though I’ve no idea whether they are better or worse than the others

in a nutshell, companies House is simply a repository for fillies, the others extrapolate and present data in varying forms. All will offer a free trial (and are negotiable on price). So you can see which best suits your needs

Clinton · Dec 30, 2019

I do have a £250 voucher for CreditSafe so should give that a try.

But I'm guessing that both CreditSafe and Duedil would have T&Cs blocking me from reproducing that data elsewhere.

I was thinking if I got the data directly from CH then I'd be free to publish the data itself and to publish analyses / graphs derived from that data.

Thanks for the cocredo suggestion. I wasn't aware of them. Not sure whether they provide just credit reports or whether I'd be able to get balance sheet items from them. I'll check them out.

Mark T Jones · Dec 30, 2019

Interesting question re reproducing data. Effectively the information they have is all in the public domain so other than the presentation style it’s hard to see how they could police it.

Clinton · Dec 30, 2019

I guess they are adding value to the data by cleaning it, organising it, manually correcting errors. That's an IP investment.

If I were in their place (and I have been in the past and will be again shortly), I would insert poison pills in the data to enable algorithmic detection of reproductions elsewhere that are trying to pass themselves off as based on publicly available data rather than on data scraped from my website.

fisicx · Dec 30, 2019

It seems that most providers have an API eg: https://www.duedil.com/api

Which means extracting the data and updating a spreadsheet shouldn’t be too difficult.

Financial-Modeller · Dec 30, 2019

A few random thoughts @Clinton

I don't have a working knowledge of analysing iXBRL data, but recall that there are some Excel Add-Ins available that may be of use - if you're ultimately planning to use Excel.

Article about Xinba et al:
https://www.journalofaccountancy.com/issues/2013/apr/20126677.html

This feels like a perfect application for Python though. Someone somewhere will be developing the tool that you will need if you have time to wait for it.

On a practical level, I assume that the universe of companies that will be under your coverage is sufficiently large that a daily alert that "companies X, Y and Z have updated their R&A at CH" and then manual update (yourself or outsourced) is not an option(?)

Finally, have you asked CH what they can offer you?

Clinton · Dec 30, 2019

It seems that most providers have an API eg: https://www.duedil.com/api

Which means extracting the data and updating a spreadsheet shouldn’t be too difficult.

What they say and what they do are very different things!

That's with most providers.

I did speak with Duedil earlier today, after reading their API documentation, and the useless woman couldn't tell me anything! She couldn't even answer the first question about the endpoints provided in their API.

And without knowing what data they are actually making available via the API, there's no point in signing up and paying money! That's not to mention other restrictions they might have in terms of the number of calls allowed per day / month. Anyway, they were going to get back to me by email. Let's see if they do.

On a practical level, I assume that the universe of companies that will be under your coverage is sufficiently large that a daily alert that "companies X, Y and Z have updated their R&A at CH" and then manual update (yourself or outsourced) is not an option(?)

500-1000 companies. So, manual is not practical

The speed of development in this area is glacial. The link you provided was to an article from 2013, a long time ago and we still don't have any good solution. And I suspect those tools that people have already been working on are designed for XBRL, as used by the SEC and others, and further modifications to any existing tool will be needed if the input is CH's iXBRL (which is close to XBRL but not exactly the same).

I haven't thought about asking CH. I'm not sure they'll be any use. But they have a forum for API users. Maybe I'll try that.

dotcomdude · Dec 30, 2019

I'm not familiar with iXBRL myself, but isn't it basically XML format? If so, I would say that a good Python developer would be able to do what you want.

Maybe try searching the freelancer websites again but looking for Python and XML skills?

I'm sure if you actually specced out and posted the job you would get some applicants, but it can still be hard work sorting the jokers from the professionals.

Maybe break it down into one much smaller job such as grabbing one file and parsing it - with payment on completion. Then if you get someone that knows what they're doing you get them to build the system out?

Clinton · Dec 30, 2019

Yeah, I suppose I could do that. However, after examining a few iXBRL files today I reckon it would need to be someone with a sound knowledge of accountancy as well as Python / XML.

KM-Tiger · Dec 30, 2019

Clinton said:
someone with a sound knowledge of accountancy as well as Python / XML

I can't remember his username but there is a Python developer on this forum.

Posted in the Brexit thread a few times.

ffox · Dec 30, 2019

@Clinton - Ha Ha - 1st get a decent SS - Excel will do this fine.

1. Basically you just need to drag the iXBRL file into cell A1 of an Excel workbook. It will automatically read and deploy the data.

2. Scan the results and look for the data you want to extract.

3. Map the fields (cells)

4. Create a macro (VBA) to load each file in turn and read the required data into a summary SS.

You could also do the same using c# across the raw iXBRL file.

Clinton · Dec 30, 2019

You can import XML into LibreOffice Calc as well

And there are free online tools to convert XML to CSV (example)

Basically you just need to drag the iXBRL file into cell A1 of an Excel workbook.

Give it a shot. From all my research, that doesn't work. It may work with an XBRL, sort of (there are various commercial tools available to assist), but then I'd have to run a batch conversion of a whole bunch of iXBRLs to XBRL's XML.

Anyway, this is taking too much time. I'm going to shelve it for the moment and come back to it later.

I am linking to another discussion in UKBF for my own convenience for when I come back to this thread. And there's also a PDF to XBRL conversion service (which is expensive as it requires some manual work to remove mistakes).

Thanks all.

ffox · Dec 30, 2019

Okay. When you revisit it remember that iXBRL. as with any other XBRL, is a fully indexed flat file with a rigid taxonomy. To create a bulk extraction program first define what data you want from the file then discover what tags or labels precede the required data in the iXBRL.
The labels or tags are the index elements and the data will always follow the label.

Scan the iXBRL file with whatever tool you want, spreadsheet macro, VBA, Visual basic, C# or any of the modern workflow tools, register the indexes and read the data.

JacekW · Dec 31, 2019

Hi Clinton,

it's Jack here from Bizdb. I'll send a direct message and maybe I can prepare a custom solution for you.

Mark T Jones · Dec 31, 2019

JacekW said:
Hi Clinton,

it's Jack here from Bizdb. I'll send a direct message and maybe I can prepare a custom solution for you.

just call him. He loves a sales call

gpietersz · Dec 31, 2019

There are a number of Python XBRL libraries, one of them would probably be a good basis for what you want to do.

Otherwise you can use lxml directly - I am not sure why people are using Beautiful Soup here, but not all of them are.

The documents are supposed to be validated, and the whole point of the format is that it is machine readable. I would expect it to be a LOT easier than most web scraping.

It looks to me like a relatively easy problem to solve with a mixture of crawling the CH urls for the companies you want to monitor and lxml parsing of the documents. The devil is in the details though - what info, how much info, whether you need adjusted numbers calculated etc.

From past experience (many years ago) there used to be pretty good (and expensive) global databases of listed company accounts (with adjustment number for comparison across countries), but I doubt that is what you want.

John Hemming · Dec 31, 2019

The point about iXBRL is that it uses xHTML which is both XML and HTML. Hence an HTML parser can work out how to display it and an XML parser can get out the key information in the relevant taxonomy.

The files don't tend to be that big so they don't need a streaming parser and in java a standard DOM parser would be able to handle it and extract the key information to put into a database.

The only complex question arises where there might be one to many relationships in the iXBRL tags, but it may be that this information can be ignored.

John Hemming · Dec 31, 2019

If someone wants to look at this. This link
https://beta.companieshouse.gov.uk/...MWFkaXF6a2N4/document?format=xhtml&download=1

Will download an iXBRL file. If you open it with Chrome and then press Control U you will see the tagging

Eg:

<p>Registered Number <ix:nonNumeric contextRef="dcur4" name="uk-bus:UKCompaniesHouseRegisteredNumber">00854667</ix:nonNumeric>
</p>

This gives a paragraph when printed which gives the company number and also provides a machine readable "uk-bus:UKCompaniesHouseRegisteredNumber" which is the company number. It is the same actual bit sequence of information.

Lovetts Solicitors · Jan 3, 2020

On Clinton's other point about being made aware when a company changes, Companies House have just launched a streaming API service. This needs a bit of technical nous but it's essentially a real-time notification feed which you subscribe to, and the moment Companies House make a change you'll get an electronic 'ping' on your feed which you can then use to trigger a process to re-download information for that company.

Essentially Companies House want to make it easier for people like Clinton to maintain their own database - instead of downloading the full zip file of data every day you would just download the full data once, then use the streaming update service to automatically keep your copy of the data up to date.

Full details at https://developer-specs.companieshouse.gov.uk/streaming-api/guides/overview

Andrew

gpietersz · Jan 4, 2020

As far as I can see the streaming API cannot be filtered (I only had a quick look) so you end up getting everything and with something like 4 million registered companies.

If you only want data for particular companies ( @Clinton says "a bunch of companies") you are better off with the REST API which does seem to do everything you need had has generous limits and is easier to implement.

Lovetts Solicitors · Jan 7, 2020

I think the idea with the streaming API is you listen out for a company number you are interested in (as the stream only pings a company number if there has been a change for that company), and then use that as a trigger to pull updated information for that company. Anything you aren't interested in from the stream you just discard.

But yes, if there's only a small number of companies and you don't need realtime then pulling the data once a day (or even once an hour) should work as well.

gpietersz · Jan 7, 2020

You still have to process it all and the streaming API is probably a bit more complex to implement (you need to make sure you do not miss any data if you get disconnected).

Tenthnote · Nov 5, 2024

Hello

IXBRL data can be complex to work with and while there are software libraries to convert individual IXBRL files, things get trickier when the data processing has to happen for the millions of files available in CH database. You also need an efficient way to query the extracted data.

I have built a free JSON API that you can hook into to pull company financial information i.e. balance sheet, profit & loss statement and cash flow and all other fields that are sent across to Companies House in the standard IXBRL filing. I have a documentation section on the website describing the requests and responses and there is a web search version available too to test it before you write any code to plug the data into Excel.

I can't post a link here but if you google the word fastukcompanysearch , and open the site , the CompaniesHouse IXBRL Converter link is in the menu at the top.

FastUKCompanySearch itself is a related but different Android and iPhone app with due date tracking and filing history download history for UK Companies, again from Companies House.

Thanks

AlanJ1 · Nov 5, 2024

Tenthnote said:
Hello

IXBRL data can be complex to work with and while there are software libraries to convert individual IXBRL files, things get trickier when the data processing has to happen for the millions of files available in CH database. You also need an efficient way to query the extracted data.

I have built a free JSON API that you can hook into to pull company financial information i.e. balance sheet, profit & loss statement and cash flow and all other fields that are sent across to Companies House in the standard IXBRL filing. I have a documentation section on the website describing the requests and responses and there is a web search version available too to test it before you write any code to plug the data into Excel.

I can't post a link here but if you google the word fastukcompanysearch , and open the site , the CompaniesHouse IXBRL Converter link is in the menu at the top.

FastUKCompanySearch itself is a related but different Android and iPhone app with due date tracking and filing history download history for UK Companies, again from Companies House.

Thanks

This thread is over 4 years old.

Scalloway · Nov 5, 2024

AlanJ1 said:
This thread is over 4 years old.

Even so this information may be of interest to the OP and others.

Tenthnote · Nov 13, 2024

Yes, bit of an old thread but google is likely to land people here when someone searches for ways to convert Companies House IXBRL to a more user friendly format as there as it is quite a niche topic

numbersrule · Nov 13, 2024

I don't get what iXBRL is for.

If folks have been having so much hassle reading the files for so long, what do Companies House actually do with all the iXBRL files?

Scalloway · Nov 14, 2024

iXBRL tags enable automated processing of business information by computer software, It cuts out the handling of paper.

fisicx · Nov 14, 2024

Much like XML, iXBRL files should be readable in a browser. It’s a simple tagging language.

numbersrule · Nov 15, 2024

fisicx said:
Much like XML, iXBRL files should be readable in a browser. It’s a simple tagging language.

Yes, thanks fisicx, that seems beneficial, but if it's simple why do so many of the posts in this thread make it sound very difficult to extract data from iXBRL? It seems like the creators have produced a file type that is made up of simple tags that only a select few can actually access.
Ah, maybe that is the intention?

Cookie Consent

Essential

Analytics

Marketing

Any suggestions on extracting information from Companies House XBRLs?

The Byre

ffox

ffox

Lovetts Solicitors

Lovetts Solicitors