Having spent several years working as a Data Scientist and implementing AI/Machine Learning solutions, I often get questions about how to get started learning the skills needed to work with data. My immediate answer, and one I have used several times, is to take advantage of learning opportunities offered on the internet – YouTube and university websites to learn the technical and foundational concepts; and learning platforms for hands-on applications of those concepts. Thinking about this a little more, though – while these learning opportunities provide a basic understanding, the key to grasping these concepts is to apply them in situations where it is necessary to really understand the data. I encourage who is working on a data-related problem (whether Machine Learning or general data analysis) to periodically take a step back and think about whether the following suggestions might apply:

Don’t immediately reach for the latest, greatest technology.

Starting low-tech forces you to understand the problem and the data you have, while high-tech solutions are built to take control – giving the user very little insight into what is happening behind the scenes. Sure, you could give ChatGPT a dataset and ask it to find correlations for you, but it won’t know to ask for additional information that could be useful to solving the problem. Let’s consider a thought exercise related to this news story from 2012: “How Target Figured Out a Teen Girl Was Pregnant Before Her Father Did.”

A Target statistician noticed that an increase in purchases of unscented lotion and vitamins correlated with a high likelihood that the customer was pregnant. How did they get the data to come to this conclusion? They understood what data they had and used it effectively. Think about this: anything you purchase at a store using a specific credit card can be stored and analyzed for your personalized buying trends. If you sign up for a shopping rewards/company loyalty program, store credit card, or even fill out the survey on your receipt, this provides extra identifying information (name, age, address, phone number, e-mail, etc.) to associate with your buying habits. If you sign up for an additional service such as a baby or wedding registry, that provides the company with even more information. This is on top of the vast amounts of information that can be purchased from data brokers. With an understanding of the goal and what data is available, the information available can be aggregated and correlated to create powerful insights without ever needing to use machine learning (Take that, ChatGPT!!).

Using a subset of a dataset may be more powerful than using all of the data.

It’s natural to think that the more data you have to work with, the better. However, what really matters is the type of data we’re talking about. If you’re trying to create an image recognition model that can tell the difference between a picture of a kangaroo and a picture of a wallaby, then you don’t need pictures of flamingos in the mix – that will only confuse the algorithm! Narrowing the data to only kangaroo and wallaby pictures allows the algorithm to focus on and learn the essential differences between the two types of animals.

Focusing solely on your goal can make you lose sight of the bigger picture.

Besides being good overall life advice, remembering to look at the big picture can open you up to alternate possibilities. Be open-minded about your data. If you can’t use it to solve the problem you’re looking to solve, what other insights can it provide? Can the data be modified or combined with alternate data sources to provide a different perspective? In 2017, I entered a “Hacking the Home” competition that provided participants with free smart home devices in exchange for demonstrations of a useful home solution, security weakness, or vulnerability.

With only a little experience in Embedded Development/Reverse Engineering, I racked my brain to come up with an idea for my chosen device, the Google Home. Being drawn to the data side of things, I decided to research the data being processed by the device. This led me to a treasure trove of data (see Google Takeout), including latitude/longitude coordinates for my phone that were being collected every few seconds.

Intrigued, I built scripts to analyze and interpret the data, realizing that someone with access to the data could very easily infer where I live, work, and where I would be at a certain day and time. (Note that this was a bit before people realized how much information the social media giants were collecting and using for their own internal purposes!). Being able to come up with an idea that fit the spirit of the competition (data – not device – hacking exposing valid privacy concerns) taught me that there are often alternate ways to approach a problem. The presentation I gave on this work resulted in an Honorable Mention in the competition, several invitations to present my work (including at a major hacker conference), and the confidence to say ‘yes’ to challenges outside my comfort zone!

Balance is good.

Also good life advice, but extremely important when it comes to data. It’s important to remember that algorithms are lazy. Let’s say you are trying to build a machine learning model to identify spam e-mails, and the training data consists of 95% spam e-mails and 5% non-spam e-mails. The algorithm will easily figure out that if it always guesses spam, then it will be right 95% of the time (not what the developer had in mind, I’m sure).

Alternatively, a balanced, 50/50 split in the training data forces the algorithm to dive into the characteristics of each e-mail to determine its level of spam-iness. This example shows the importance of understanding your data enough to determine if you are unwittingly weakening the resulting model by providing it with biased data. I hope these examples provide an understanding of the thinking involved in working with data – whether you’re a newbie Data Scientist, curious about how Machine Learning works, or just inspired to play with data!

To learn more about Praxis Engineering Technologies, please visit www.praxiseng.com.

Browse careers.

 

 

SPONSORED CONTENT: This content is written on or behalf of our Sponsor.

Related News

ClearanceJobs.com, the largest security-cleared career network, specializes in defense jobs for professionals with security clearances. Search thousands of jobs from pre-screened, registered defense industry employers.