You are currently viewing 5 Tips for Working with Raw Data: Data Pre-Analysis

5 Tips for Working with Raw Data: Data Pre-Analysis

There is nothing more discouraging than starting an analysis and coming across a set of data that is not organized. It could be files with many spreadsheets and columns without a well-defined pattern, databases with a large number of tables with indecipherable naming conventions, or records themselves without a standardized structure.

No matter the form of disorganization, looking at raw data without good formatting always presents a unique challenge to decipher, organize, and extract what is most useful to bring forth answers to the case.

Or, it may be that the data is not even unstructured, but the sheer volume of files, spreadsheets, and tables is so large that identifying what will truly be useful becomes a challenge of equal proportion.

Unfortunately, we still don’t have a magic tool, even in the age of artificial intelligence, that can decipher the structure of data and provide a map of the most useful pieces for the cross-references we need to make.

The encouraging part is that, unless the data is truly unreadable, there is always a way to identify the location of the most important information for the investigation and a way to format and transform it into analysis diagrams that bring answers to the investigative questions.

The truth is, the world is not usually complete chaos, and data is not always entirely unstructured. We may be talking about data that only needs small adjustments or needs to be viewed in the right way to be perfectly cross-referenced.

Therefore, in this article, I bring some tips that can help structure the data that will be analyzed, whether they are very disorganized or just a little.

  1. Keep the main questions in mind

Knowing clearly in advance what you are looking for is a key factor in helping organize the data. When you have a clear question in mind, it becomes easier to look at raw data and find the location of the information that will be used in the analysis.

For example, if I am looking for data about the relationship between people and companies in a database from a large tax system, looking for those related to the taxpayer registration, their partners, accountants, etc., will help filter out the tables that are not related to the main theme of the analysis.

Therefore, always take a few minutes before starting to explore the files to study the problem and develop the main questions that need to be answered.

  1. Mentally outline the relationship structure of the analysis

I always say in my training that we need to develop the ability to look at a spreadsheet and see an analysis diagram of links in it.

What does this mean in practice?

Imagine you are facing a complex spreadsheet with data from a phone bill containing dozens of columns with information on phones, IMEIs, phone calls, ERBs, addresses, account holders, among others.

This spreadsheet needs to be transformed into an analyzable diagram. For that, we need to focus on what objects can be created from each record in the spreadsheet, how they relate to each other, and which information goes where.

Which phone is the source phone? Which phone is the destination phone? What are the details of the phone call? Which antennas were used by which phone? What is the address of each antenna? Who are the account holders for the phone lines? Which columns are attributes of which entities and links?

Being able to look at a record and see entities, links, and attributes helps a lot in the process of organizing data sources.

  1. Don’t get distracted by data that is not part of the context

After identifying the objects, which ones are really important for the questions I want to answer?

At this point, the important thing is to be able to ignore (at least initially) the data that does not belong to the main question and those that do not belong to the objects that will be analyzed.

Resist the impulse to analyze everything at once!

In the previous example, imagine the question is “discover who the target phone communicates with most frequently.” To answer this, it will be necessary to analyze only the phone calls and the account holders of the phone lines, and any additional information like IMEI, ERB, address, etc., will not be needed at this stage.

Therefore, simply ignore them and continue with your analysis until you reach a question where they might be useful.

The problem here is that an excess of information can create more visual confusion than help in reaching the answers.

  1. Focus on data patterns

Once the structure is clearer, it’s time to look at the data itself. One of the factors that often causes confusion in analysis is the lack of standardization in writing and formatting.

Going back to the phone data, it is very common for the same phone number to be written in several ways in the same phone bill, as in the following example:

+55 41 99988-7766

55 41 99988-7766

041 99988-7766

41 99988-7766

99988-7766

The same can happen with document numbers, people’s names, bank account numbers, dates, times, or any other information that repeats throughout the records.

Therefore, it is always important to visually inspect the data before transforming it into diagrams, because data written in different formats will end up becoming different objects in the diagrams, which is highly undesirable.

The simple action of organizing columns alphabetically in Excel, for example, can help identify the pattern (or lack of it) in the data and indicate which columns need special attention when being used.

Of course, good link analysis tools, such as Caseboard, have the ability to identify similar identities, and this can save a lot of time in the analysis even after the data has been transformed into diagrams.

  1. Pay attention to multiple files

Another point of special attention is when the analysis needs to be done with files from different sources, such as data from phone or bank statements, because they may be formatted differently.

When I say formatted, I’m not just referring to the name and order of columns, but also to the formatting of the data in the records, as pointed out in the previous item, in addition to different date and time formats, which is very common.

Once, an investigator was analyzing data from two different phone companies, and he was sure there was a correlation between the targets, but his diagrams were creating two separate groups.

Upon examining the data more carefully, a small difference in the pattern of writing the phone numbers was detected, and the diagram was creating the “same phones” multiple times because their discriminators ended up being different.

That’s why it’s always important to pay attention to how each source uses a standard for its data.

Conclusion

Link analysis comes with a series of challenges, and the first of them might be precisely in the raw data. Therefore, having a good strategy and adopting best practices for working with the initial data greatly contributes not only to the quality of the diagrams that need to be produced but also to the time of the entire process, avoiding wasting time on information that does not need to be used.