In mathematics, graph theory allows you to find the relationships between objects in a given set. In this case, the connections between vertices and edges that the graphs show can be used to investigate large databases in a visual and interactive way. Thus, it is possible to establish the link between companies, politicians, payments, public works, among other data. This web can be explored and visualized with the aid of graphs.
The mathematical theory of graphs is already used in computing in many situations, such as to find out what are the direct relationships that an object has, what are the possible paths between two or more objects, or which object is the most connected (that has more relations).
At the moment, for example, CruzaGrafos already has 29.4 million records, of which there are 20 million people and 9 million companies. And in the graphs we can see the proximity and society relationships of all this information.
Graphs gained a lot of visibility with Panama Papers, a 2016 award-winner work led by the International Consortium of Investigative Journalists. Among the technologies used were the database management systems in Neo4J and Linkurious graphs for data search and visualization.
Cleaning and analysis work
The CruzaGrafos project team performs exploratory data analysis in databases of public interest in Brazil. The work was done through research on open data access portals in Brazil, conversations with experts and a study of the information left by journalist Claudio Weber Abramo.
We have classified this data, its characteristics, rows and columns, and we also have studied its feasibility of use with the Metabase software. Public data portals and APIs that can make it easier the updating of project information were also analysed.
Presentation and examples of using CruzaGrafos in the 2nd. Abraji Data Sunday in September 2020.
These studies were important to know which databases would need information cleaning, which information are names of people or companies, which information is IDs, which information can be used as crossing keys between different databases, which databases are of real public interest or can be explored in the other phases of the project, among other factors.
Currently, CruzaGrafos has electoral candidacy data collected at the Superior Electoral Court (TSE in portuguese), with general information such as election year, position, full name, urn name, sequential number in the election, political party, electoral unit, federative unit and full CPF of the candidate.
And also data from the Federal Revenue of Brazil (Receita Federal in portuguese) on companies with QSA (corporate structure, names of partners and administrators), with information such as trade name, corporate name, full names of members, CNPJ (ID used to identify companies in Brazil) and the "masked" CPF (ID used to identify people in Brazil) of their partners - the Revenue and other public institutions do not publish the entire 11-digit CPF content, but do put asterisks in some numbers, as in this example: ***. 270.068 - **.
The platform connections and graphs are then produced by crossing the main identification keys - in this case, CPF, CNPJ and full name.
These databases are periodically updated by public agencies and will also be updated in CruzaGrafos. The project will also include other databases of public and journalistic interest over the months.
In mid-2020, the project also sent invitations to over 40 journalists from Brazil and Latin America, with or without experience in data analysis. From there, these tests and criticisms were made with adjustments and improvements.
In October, 80 journalists who joined the course “Journalism, Covid-19 and Corruption” taught by Abraji and Transparency International Brazil, with support from the Konrad Adenauer Foundation, also accessed CruzaGrafos. The professionals were able to learn practical techniques and report their opinions.
“There are no perfect databases or it is rare to find a complete source of information about a person or company of public interest. So what we have built with CruzaGrafos is a great possibility to find relevant guidelines and information, but that must be complemented by checks and more verification ”, says Reinaldo Chaves, Abraji's projects coordinator.
Check out what you can do
- Search for all companies linked to a politician/candidate for public office in which he or she is a partner or administrator
- In these companies see who are the other partners
- Also check the proximity network of these partners, that is, of which other companies they are partners and the other respective partners, in different degrees of proximity
- Find out if the path that separates a person/company from another person/company is short or if it really exists
- Having in advance the list of relatives or advisers to politicians or people of public interest, find out if they have companies (a tactic that could be used to cover up assets, for example)
- Check whether a politician or candidate in the elections has companies in economic sectors that may conflict with their public office
- Know if a politician/candidate or person of public interest has several companies on their behalf in the same industry and/or with similar names (tactic that could be used for washing, for example)
And with this type of information, continue to investigate issues such as:
- Collect additional information from the Federal Revenue, using the CNPJ number, such as the company address and the value of its share capital. This information can show signs of discrepancy, for example if it wins a large bid. In Google Street View you can also search if there is a recent image of the company to see its facade and neighborhood
- In the courts or in the Publique-se project, also developed by Abraji, to check if there are legal proceedings that cite people or companies
- Check if the companies of interest have bids, agreements and contracts listed on the federal, state or municipal transparency portals
- Do a search in registries to find out if the people found have properties and where
- Find out if the companies of interest have active debts or are in the register of environmental punishments or labor rights
Image showing the connection of a recently convicted former federal deputy, Anibal Ferreira Gomes (CE), and his former parliamentary advisor, through various societies.
Abraji also posted the content of the course “Journalism, Covid-19 and Corruption” on the air free of charge, which shows all these techniques and others for using CruzaGrafos (in this link). Each video also has in its description a folder with more materials for reading and step by step that study recent cases in investigation in Brazil of public people and companies. Here and here are two direct links to case studies.
"There is no merit assessment of the content of the databases by CruzaGrafos, Abraji, Brasil.IO or the professionals involved in this project. Evidence of illicit conduct should be checked with sources and more information and the fact that anyone being investigated does not mean that she or he is guilty. All data must be checked, including with the politicians and companies cited. One should always be careful with people and companies with the same name”, states Chaves.
Innovations to work
During the development of the project many problems arose. Cataloging, cleaning and publishing large public databases, which in Brazil are often dispersed and published in formats that are difficult to analyze or with huge amounts of information, required a lot of effort and creating solutions.
Nevertheless, it was necessary to create an entity centralizer, which allows the search for names, companies, municipalities, hospitals, contracts, etc., and gives us the unique universal identifier (UUID). Entities can be: companies, people, applications, etc. The lack of a UUID brings problems such as the need to filter several fields at the same time (which change from dataset to dataset), difficulty to search in more than one dataset, difficulty in generating offline ID for external queries , among others.
The backend of the created graph is the "heart" of the system, which connects to the previous system to search and manage queries in the graphics database, API etc. and the tool itself working creates the "glue" of everything and is the most specific part. So we have the integration with the authentication of the Abraji membership system, in which we have the scripts that feed the two systems above and the interface that the user accesses.
Regarding data processing, it was also necessary to create current solutions and others still being finalized, for example: for information on members of Brazilian companies, Brazilian CNPJs - unique company identification code in Brazil -, corporate activities by CNPJ, political candidacies, political donations, health contracts, among other main bases to be selected for launch and for the coming months.
Expanded neighboring nodes and Expanded neighboring nodes up to 2 degrees have been implemented. This allows you to quickly expand the visualization of connection graphs between people and companies - thus revealing the degrees of connection nearby.
The Save Graph functionality was also made, which will be very useful during tests and for users to save and return to an investigation. Those who access the tool can also click on Export CSV to transform the graph shown on the screen into a spreadsheet format.
We have also built a solution to calculate the “path between objects”, which calculates the shortest path between two people / companies and shows in the graph.
And we added a functionality that was not initially planned, but that will help a lot in usability, after tests that we did internally: browse the history of the objects (people and companies) researched.
Next steps and opening for non-members
Updates to the bases of the Superior Electoral Court (TSE) and the Internal Revenue Service of Brazil (Receita Federal) will be published periodically on the platform - including data from the 2020 Elections. New databases will also be included and users will be informed on the platform itself and in our project communication.
All the steps described in obtaining the data, checks and the source code of the platform will also be made available soon on GitHub. If you have identified an error or have any suggestions, we kindly ask you to contact us at [email protected]
CruzaGrafos can also be accessed by non-members of Abraji from November 12th, 2020. However, they must register on the project's website as of that day. It is also necessary to explain that access will only be granted to people or institutions without professional ties to political parties, with any body of one of the Executive, Legislative and Judiciary branches or with entities that promote business lobby. They will receive free access for 30 days and then have to pay a subscription fee of R$ 30 per month (US$ 5.2). Companies wishing to subscribe to the service for groups of employees should contact Abraji and request a quote.
Written by Reinaldo Chaves