Any suggestions on extracting information from Companies House XBRLs?

Clinton

Free Member
  • Business Listing
    Jan 17, 2010
    5,750
    1
    3,070
    ukbusinessbrokers.com
    Any suggestions on extracting information from iXBRLs from CH?

    I have a bunch of companies whose financials I need to monitor. I need some way of getting iXBRL files from Companies House and extracting the data from there to put into a spreadsheet / database. I'm aware of the CH API and there are one or two parsing projects in development (examples: https://github.com/ONSBigData/parsing_company_accounts and https://www.codeproject.com/Articles/1227765/Parsing-XBRL-with-Python). Most use the python BeautifulSoup library. But by the coders' honest admission, their XBRL parsers are a bit flaky.

    And I don't want to get into OCR-ing PDFs.

    Anyone has any other suggestions?

    There needs to be a way of monitoring CH for changes to the companies I'm tracking. When there's a new iXBRL, I would need it downloaded, parsed and the data added to my spreadsheet / database so I can make graphs showing, for example, how the company's net asset figure has changed over the years. (I have multiple VPS accounts and can setup a cron job if that would make the monitoring bit easier).

    The daily zip file from Companies House is not that useful in this context (http://download.companieshouse.gov.uk/en_accountsdata.html).

    My ideal solution would require visiting individual company pages at CH and downloading the iXBRLs. I tried places like Freelancer to find someone to do this and nothing turns up in searches of XBRL so no easy way to find any talent.

    https://www.bizdb.co.uk/ does collate and publish some data, but it doesn't seem to do that great a job. For several companies I checked in their database there were gaps in the tables where numbers needed to be. And, besides, bizbd don't offer a licence to get what I want, so I'd have to scrape their data, page by page. This is not something that's worth doing given the poor quality of data there.

    Help!
     
    Jun 26, 2017
    2,713
    1,012
    Sorry Clinton all those letters mean very little to me, but in terms of summarising the data, CreditSafe does it well. Paid subscription required, but it’s well worth the cost.

    They seem to have found a way to summarise the financials from the companies house pdfs, so if you can find a way to figure out what they’ve done, that would be ideal I guess, but if not then it might be a better source of data in itself.
     
    • Like
    Reactions: Clinton
    Upvote 0
    I’d go with the above. Personally I use cocredo - though I’ve no idea whether they are better or worse than the others

    in a nutshell, companies House is simply a repository for fillies, the others extrapolate and present data in varying forms. All will offer a free trial (and are negotiable on price). So you can see which best suits your needs
     
    • Like
    Reactions: Clinton
    Upvote 0

    Clinton

    Free Member
  • Business Listing
    Jan 17, 2010
    5,750
    1
    3,070
    ukbusinessbrokers.com
    I do have a £250 voucher for CreditSafe so should give that a try. :)

    But I'm guessing that both CreditSafe and Duedil would have T&Cs blocking me from reproducing that data elsewhere.

    I was thinking if I got the data directly from CH then I'd be free to publish the data itself and to publish analyses / graphs derived from that data.

    Thanks for the cocredo suggestion. I wasn't aware of them. Not sure whether they provide just credit reports or whether I'd be able to get balance sheet items from them. I'll check them out.
     
    Upvote 0

    Clinton

    Free Member
  • Business Listing
    Jan 17, 2010
    5,750
    1
    3,070
    ukbusinessbrokers.com
    I guess they are adding value to the data by cleaning it, organising it, manually correcting errors. That's an IP investment.

    If I were in their place (and I have been in the past and will be again shortly), I would insert poison pills in the data to enable algorithmic detection of reproductions elsewhere that are trying to pass themselves off as based on publicly available data rather than on data scraped from my website.
     
    • Like
    Reactions: Mark T Jones
    Upvote 0

    fisicx

    Moderator
    Sep 12, 2006
    46,668
    8
    15,360
    Aldershot
    www.aerin.co.uk
    It seems that most providers have an API eg: https://www.duedil.com/api

    Which means extracting the data and updating a spreadsheet shouldn’t be too difficult.
     
    Upvote 0

    Financial-Modeller

    Free Member
    Jul 3, 2012
    1,523
    626
    London
    A few random thoughts @Clinton

    I don't have a working knowledge of analysing iXBRL data, but recall that there are some Excel Add-Ins available that may be of use - if you're ultimately planning to use Excel.

    Article about Xinba et al:
    https://www.journalofaccountancy.com/issues/2013/apr/20126677.html

    This feels like a perfect application for Python though. Someone somewhere will be developing the tool that you will need if you have time to wait for it.

    On a practical level, I assume that the universe of companies that will be under your coverage is sufficiently large that a daily alert that "companies X, Y and Z have updated their R&A at CH" and then manual update (yourself or outsourced) is not an option(?)

    Finally, have you asked CH what they can offer you?
     
    • Like
    Reactions: Clinton
    Upvote 0

    Clinton

    Free Member
  • Business Listing
    Jan 17, 2010
    5,750
    1
    3,070
    ukbusinessbrokers.com
    It seems that most providers have an API eg: https://www.duedil.com/api

    Which means extracting the data and updating a spreadsheet shouldn’t be too difficult.
    What they say and what they do are very different things! ;)

    That's with most providers.

    I did speak with Duedil earlier today, after reading their API documentation, and the useless woman couldn't tell me anything! She couldn't even answer the first question about the endpoints provided in their API.

    And without knowing what data they are actually making available via the API, there's no point in signing up and paying money! That's not to mention other restrictions they might have in terms of the number of calls allowed per day / month. Anyway, they were going to get back to me by email. Let's see if they do.

    On a practical level, I assume that the universe of companies that will be under your coverage is sufficiently large that a daily alert that "companies X, Y and Z have updated their R&A at CH" and then manual update (yourself or outsourced) is not an option(?)
    500-1000 companies. So, manual is not practical :( The speed of development in this area is glacial. The link you provided was to an article from 2013, a long time ago and we still don't have any good solution. And I suspect those tools that people have already been working on are designed for XBRL, as used by the SEC and others, and further modifications to any existing tool will be needed if the input is CH's iXBRL (which is close to XBRL but not exactly the same).

    I haven't thought about asking CH. I'm not sure they'll be any use. But they have a forum for API users. Maybe I'll try that.
     
    Last edited:
    Upvote 0

    dotcomdude

    Free Member
    Business Listing
    Jul 27, 2018
    532
    110
    I'm not familiar with iXBRL myself, but isn't it basically XML format? If so, I would say that a good Python developer would be able to do what you want.

    Maybe try searching the freelancer websites again but looking for Python and XML skills?

    I'm sure if you actually specced out and posted the job you would get some applicants, but it can still be hard work sorting the jokers from the professionals.

    Maybe break it down into one much smaller job such as grabbing one file and parsing it - with payment on completion. Then if you get someone that knows what they're doing you get them to build the system out?
     
    • Like
    Reactions: Clinton
    Upvote 0
    @Clinton - Ha Ha - 1st get a decent SS - Excel will do this fine.

    1. Basically you just need to drag the iXBRL file into cell A1 of an Excel workbook. It will automatically read and deploy the data.

    2. Scan the results and look for the data you want to extract.

    3. Map the fields (cells)

    4. Create a macro (VBA) to load each file in turn and read the required data into a summary SS.

    You could also do the same using c# across the raw iXBRL file.
     
    • Like
    Reactions: Clinton
    Upvote 0

    Clinton

    Free Member
  • Business Listing
    Jan 17, 2010
    5,750
    1
    3,070
    ukbusinessbrokers.com
    You can import XML into LibreOffice Calc as well ;) And there are free online tools to convert XML to CSV (example)

    Basically you just need to drag the iXBRL file into cell A1 of an Excel workbook.
    Give it a shot. From all my research, that doesn't work. It may work with an XBRL, sort of (there are various commercial tools available to assist), but then I'd have to run a batch conversion of a whole bunch of iXBRLs to XBRL's XML.

    Anyway, this is taking too much time. I'm going to shelve it for the moment and come back to it later.

    I am linking to another discussion in UKBF for my own convenience for when I come back to this thread. And there's also a PDF to XBRL conversion service (which is expensive as it requires some manual work to remove mistakes).

    Thanks all.
     
    Last edited:
    Upvote 0
    Okay. When you revisit it remember that iXBRL. as with any other XBRL, is a fully indexed flat file with a rigid taxonomy. To create a bulk extraction program first define what data you want from the file then discover what tags or labels precede the required data in the iXBRL.
    The labels or tags are the index elements and the data will always follow the label.

    Scan the iXBRL file with whatever tool you want, spreadsheet macro, VBA, Visual basic, C# or any of the modern workflow tools, register the indexes and read the data.
     
    • Like
    Reactions: Clinton
    Upvote 0

    gpietersz

    Free Member
  • Business Listing
    Sep 10, 2019
    2,712
    2
    705
    Northwhich, Cheshire
    pietersz.net
    There are a number of Python XBRL libraries, one of them would probably be a good basis for what you want to do.

    Otherwise you can use lxml directly - I am not sure why people are using Beautiful Soup here, but not all of them are.

    The documents are supposed to be validated, and the whole point of the format is that it is machine readable. I would expect it to be a LOT easier than most web scraping.

    It looks to me like a relatively easy problem to solve with a mixture of crawling the CH urls for the companies you want to monitor and lxml parsing of the documents. The devil is in the details though - what info, how much info, whether you need adjusted numbers calculated etc.

    From past experience (many years ago) there used to be pretty good (and expensive) global databases of listed company accounts (with adjustment number for comparison across countries), but I doubt that is what you want.
     
    Upvote 0
    The point about iXBRL is that it uses xHTML which is both XML and HTML. Hence an HTML parser can work out how to display it and an XML parser can get out the key information in the relevant taxonomy.

    The files don't tend to be that big so they don't need a streaming parser and in java a standard DOM parser would be able to handle it and extract the key information to put into a database.

    The only complex question arises where there might be one to many relationships in the iXBRL tags, but it may be that this information can be ignored.
     
    • Like
    Reactions: Clinton
    Upvote 0
    If someone wants to look at this. This link
    https://beta.companieshouse.gov.uk/...MWFkaXF6a2N4/document?format=xhtml&download=1

    Will download an iXBRL file. If you open it with Chrome and then press Control U you will see the tagging

    Eg:

    <p>Registered Number <ix:nonNumeric contextRef="dcur4" name="uk-bus:UKCompaniesHouseRegisteredNumber">00854667</ix:nonNumeric>
    </p>

    This gives a paragraph when printed which gives the company number and also provides a machine readable "uk-bus:UKCompaniesHouseRegisteredNumber" which is the company number. It is the same actual bit sequence of information.
     
    Upvote 0
    L

    Lovetts Solicitors

    On Clinton's other point about being made aware when a company changes, Companies House have just launched a streaming API service. This needs a bit of technical nous but it's essentially a real-time notification feed which you subscribe to, and the moment Companies House make a change you'll get an electronic 'ping' on your feed which you can then use to trigger a process to re-download information for that company.

    Essentially Companies House want to make it easier for people like Clinton to maintain their own database - instead of downloading the full zip file of data every day you would just download the full data once, then use the streaming update service to automatically keep your copy of the data up to date.

    Full details at https://developer-specs.companieshouse.gov.uk/streaming-api/guides/overview

    Andrew
     
    Upvote 0

    gpietersz

    Free Member
  • Business Listing
    Sep 10, 2019
    2,712
    2
    705
    Northwhich, Cheshire
    pietersz.net
    As far as I can see the streaming API cannot be filtered (I only had a quick look) so you end up getting everything and with something like 4 million registered companies.

    If you only want data for particular companies ( @Clinton says "a bunch of companies") you are better off with the REST API which does seem to do everything you need had has generous limits and is easier to implement.
     
    Last edited:
    Upvote 0
    L

    Lovetts Solicitors

    I think the idea with the streaming API is you listen out for a company number you are interested in (as the stream only pings a company number if there has been a change for that company), and then use that as a trigger to pull updated information for that company. Anything you aren't interested in from the stream you just discard.

    But yes, if there's only a small number of companies and you don't need realtime then pulling the data once a day (or even once an hour) should work as well.
     
    Upvote 0

    Tenthnote

    Business Member
    Business Listing
    Hello

    IXBRL data can be complex to work with and while there are software libraries to convert individual IXBRL files, things get trickier when the data processing has to happen for the millions of files available in CH database. You also need an efficient way to query the extracted data.

    I have built a free JSON API that you can hook into to pull company financial information i.e. balance sheet, profit & loss statement and cash flow and all other fields that are sent across to Companies House in the standard IXBRL filing. I have a documentation section on the website describing the requests and responses and there is a web search version available too to test it before you write any code to plug the data into Excel.

    I can't post a link here but if you google the word fastukcompanysearch , and open the site , the CompaniesHouse IXBRL Converter link is in the menu at the top.

    FastUKCompanySearch itself is a related but different Android and iPhone app with due date tracking and filing history download history for UK Companies, again from Companies House.

    Thanks
     
    Upvote 0

    AlanJ1

    Free Member
    Jul 25, 2018
    970
    283
    Hello

    IXBRL data can be complex to work with and while there are software libraries to convert individual IXBRL files, things get trickier when the data processing has to happen for the millions of files available in CH database. You also need an efficient way to query the extracted data.

    I have built a free JSON API that you can hook into to pull company financial information i.e. balance sheet, profit & loss statement and cash flow and all other fields that are sent across to Companies House in the standard IXBRL filing. I have a documentation section on the website describing the requests and responses and there is a web search version available too to test it before you write any code to plug the data into Excel.

    I can't post a link here but if you google the word fastukcompanysearch , and open the site , the CompaniesHouse IXBRL Converter link is in the menu at the top.

    FastUKCompanySearch itself is a related but different Android and iPhone app with due date tracking and filing history download history for UK Companies, again from Companies House.

    Thanks
    This thread is over 4 years old.
     
    Upvote 0

    Tenthnote

    Business Member
    Business Listing
    Yes, bit of an old thread but google is likely to land people here when someone searches for ways to convert Companies House IXBRL to a more user friendly format as there as it is quite a niche topic
     
    Upvote 0

    fisicx

    Moderator
    Sep 12, 2006
    46,668
    8
    15,360
    Aldershot
    www.aerin.co.uk
    Much like XML, iXBRL files should be readable in a browser. It’s a simple tagging language.
     
    • Like
    Reactions: numbersrule
    Upvote 0
    Much like XML, iXBRL files should be readable in a browser. It’s a simple tagging language.
    Yes, thanks fisicx, that seems beneficial, but if it's simple why do so many of the posts in this thread make it sound very difficult to extract data from iXBRL? It seems like the creators have produced a file type that is made up of simple tags that only a select few can actually access.
    Ah, maybe that is the intention?
     
    Upvote 0

    Latest Articles

    Join UK Business Forums for free business advice