Interview with FreySoft Data Scientist and Machine Learning engineer
Roman Khabun has been working in the information technology industry for 11 years. The start was with the support of hardware-software systems. Later he worked as a developer of desktop applications. And for 3 years now Roman has been working in the direction of Data Science and Machine Learning. During his career, he managed to carry out the projects of data analytics (collection and cleaning of tabular data), predictive models of time series, computer vision, and NLP.
About Data Scientist/ ML engineer role
According to a report from Anaconda, engineers spend half of their work time (45%) to clean and prepare data – that is, as much as developing solutions. Based on your experience, how much time do you spend on each of the phases from getting data science outputs into production?
This is the truth 😊 The name of the industry Data Science speaks for itself. It involves different stages of working with data, in particular collecting, cleaning, preparing for use in Machine Learning models, etc.
In my practice, I also spend up to 40% of my time collecting, cleaning, and preparing data. Typically, the stages look like this: 30% – gathering and cleaning, 30% – creating and training the model, 20% – evaluating the model metrics and validation, 10% – fine-tuning the model for maximum performance, and 10% – implementing and deploying the model.
Can you name some non-obvious facts about the data scientist profession?
The big data industry is experiencing exponential growth. This means that the data scientist must be cross-functional: collecting and processing data, designing and training machine learning models as well as bringing the model to production. I would call this very fact of super-wide cross-functionality the one that is not obvious in this profession.
About ML projects/ algorithms/ models
Let’s talk about the projects and applied algorithms you’ve worked on. Currently, what FreySoft project are you in?
Now I am working with the company for which FreySoft is a technology partner. The task of mine is to build a state-of-art NLP core that will solve the problem of intents classification and entity recognition.
Can you describe a previous project when you had to develop a complex algorithm?
Once I worked on a project, which employed several predictive models. In turn, they were presented in different scales at the output results. The complexity of the algorithm was in enhancing the overall performance of the models as an ensemble, while not losing the ability to generalize their specific input data to each of the models, that is, not to lose the results.
Today deep learning is in high demand though existed long before. How does it contrast with other machine learning algorithms?
Deep learning works with neural networks and models based on them. The main difference and how they contrast with traditional algorithms are often difficult-to-explain results. Also, deep learning algorithms require special hardware for proper performance (GPU / TPU).
Deep learning is at the forefront of AI, helping to shape the tools that we use to achieve huge levels of accuracy. Advances in deep learning have driven this tool to the point where it is superior to humans in some tasks, such as classifying objects in images.
Now there is very large research around this very question to clarify the performance of the work of models based on neural networks, especially when they show the best results. In fact, a lot of work and analysis is being carried out.
Having worked with a number of algorithms, can you pick your favorite one?
Since the topic is very broad, it is difficult to single out one particular algorithm or technology. However, I would like to highlight the Transformer architecture. It was introduced in 2017 and made a breakthrough in many NLP tasks. In addition, as it turned out later, it became useful in computer vision and biological tasks (in particular, when working with the genome).
About developments in the industry
Which data visualization libraries do you use? What are your thoughts on the best data visualization tools recently?
Mostly, I use matplotlib, which is de facto one of the most powerful libraries with a rich history. I would call it “a kind of Swiss army knife” in the field of data visualization.
What else, I often use the seaborn library. I love its heatmap implementations, which are great for rendering the confusion matrix. BTW, the last is a great powerful tool for performance evaluations’ introspection of the machine learning models.
If we talk about boxed solutions in the field of data visualization, then we should admit that they are in fact experiencing an explosive development now. Here, I can emphasize the visualization package of Tableau, Microsoft Power BI, and Google Data Studio. Each of these tools is good enough and may be a strong foundation for data visualization.
If you’ve worked with external data sources, it’s likely you’ll have a few favorite APIs that you’ve gone through. What are some of your favorite APIs to explore?
Since 99% of my daily work is related to NLP, I have a great need for quality text data. I usually use datasets from the competition at kaggle.com.
Also, in order to get a large and structured set of texts (text corpuses) I create small tools for collecting text from websites (web-scraping). Here, I would mention one interesting source of text data – Twitter. A long time ago I got a developer account on Twitter that allowed access to the API. To my opinion, this is still a very powerful natural language text data source.
Are you get acquainted with GPT-3, a language generation model developed by OpenAI? It was marked as exciting because with very little change in architecture, a ton more data, and other benefits. Though, there are many perspectives on GPT-3 throughout the Internet. What are your thoughts on GPT-3 and OpenAI’s model?
It was very interesting for me to observe the results of the work of GPT-2. Then, quite recently, in 2020, a new version of GPT-3 arrived with an even larger network volume and new capabilities. And I thought that I would like to participate in a project in which it would be necessary to apply such technologies. I guess it would be very exciting.
In addition, I want to point out that thanks to such technologies as GPT from OpenAI, the data science / ML industry will not get boring for a long-time.
How do you keep informed of developments in machine learning? What are the last ML papers you’ve read?
The resources, that I mostly observe, are on towardsdatascience.com/, medium.com, ai.googleblog.com. From the last read, I would like to note an article about the promising SMITH technology for working with large documents.
Do you train the models for fun, and what GPU/hardware then do you use?
It was a couple of computer vision models. I trained them with a GTX1050 as a GPU on a home PC. Kinda, I was trying to create a robotic vacuum cleaner with vision, nothing serious 😊
What is one key trend or development that’s happening right now that you think is playing a key role in AI’s growth?
The key trend is the worldwide growth in the amount of data every year. And this growth will only accelerate. This means that the “fuel” for the development of AI will not end in the near future. You need to store data, look for errors in it, choose what is useful and, on the basis of this, draw conclusions that are favorable for your business.
In this context, can you name three recommendations for making the most of valuable data to business owners, entrepreneurs?
Data can be both a boon and a weapon. That is why my tips would be the next:
- Come up with hypotheses and, based on data, test them. Look for patterns and insights.
- Make decisions based on data. Be data-driven. Then you are the first.
- Store your data in a safe place and trust it only to professionals.