Let’s be honest, being a data scientist is hard. A data scientist is expected to be capable in computer programming and have a deep knowledge in statistics and machine learning, the latter being an active research field. This means that the tools that are being used by data scientists are ever-changing and evolving. Keeping up to date and making the most out of the tools requires time and expertise. Furthermore, one has to adjust the existing tools to one's own needs all while often needing to develop new ones from scratch. With our data science breakthroughs, we are doing this hard work for you, leaving you the most important task: being curious about your enterprise and how you can improve your business. In the following points I will review some of the innovative capabilities of our platform:
Companies typically store different information on their customers on different tables. For example, one table holds static information such as gender, address, age, etc., while the other has information regarding actions their customers took in the past. Extracting information from various tables where the amount of data on specific customers can vary demands experience in data handling libraries (e.g., pandas) and is very, very, time-consuming. With our platform, joining tables is as easy as clicking a button.
Many datasets contain missing values. While some algorithms are tolerant in this respect, and while the fact that a value is missing might prove informative, imputing missing values is often necessary and sometimes valuable on its own. One way of dealing with missing data is dropping samples that contain a missing value. This, of course, has the disadvantageous consequence of losing data. Another straightforward method is using the mean (or the most frequent value, in case of categorical features). But doing so discards possible interaction between the features and is obviously suboptimal. More sophisticated methods, for example, make use of neural networks and, specifically, autoencoders to model the underlying patterns of the data and use this model to infer the missing values. At Pecan we have developed our own autoencoders-based machinery for this purpose.
In the process of preparing the data for a learning task, one needs to be very careful in not letting the algorithm receive access to information that in reality, it will not have. A breach in this respect leads to what is called data leakage. This means, at the very least, that the results in deployment are worse than in the lab. At Pecan, we associate every observation with a time period and make sure that no data from the future leaks into the model, thus, reducing the chances for leakage.
Dealing with categorical features is not trivial for machine learning since algorithms cannot handle non-numerical data. Consequently, categories should be replaced, somehow, by numbers. A common approach is the one-hot encoding, but this does not work well when the number of categories is large. On the other hand, replacing categorical values arbitrarily with numbers inevitably makes the algorithm think that there is some ordering between the categories. For example, if the entries USA and France are replaced with 1 and 2, respectively, then the algorithm is forced to handle France as '2 times USA'. How to handle this problem optimally is still an open problem, but the tendency is to embed the categorical values in some high-dimensional space in a way that their relative location is meaningful for the problem at hand. This usually involves training a neural network and is done automatically by our platform.
Most of the more powerful machine learning models come with a set of parameters that are fixed prior to the learning process and determine how the model trains on the data. In some cases, the default hyperparameters perform well (e.g., Lightgbm) and in others, there is just no default (e.g., neural networks). The search for better hyperparameters consists of assigning specific values and seeing whether the performance increases. This process is very time consuming since the evaluation of every assignment requires model training and certain intuition is needed regarding which assignments are more promising. Automatically finding the best hyperparameters with the least number of assignments is the purpose of automated hyperparameter optimization. From the variety of algorithms that perform this task, we chose those that use a Bayesian approach. This means that based on previous evaluations a probabilistic model of the loss function is built that guides the next assignment.
When the amount of data gets really big, standard machinery fails. Our platforms are based on the Spark platform which lets you use all your data, making sure that the learning models are exposed even to the tiniest of nuances.
Every machine learning algorithm has its disadvantages and often the best performance is achieved by harnessing multiple algorithms together. Our platform trains several different algorithms and then combines their predictions to obtain superior performance.
After learning has occurred, it is very informative for the entrepreneur to understand what in the data has the most effect on the behavior of her clients. This is called feature importance and gives a global point of view on the task. A deeper understanding is obtained by analyzing the reaction of the model on the sample level, thus, enabling one to take actions targeted at individuals. Our platform provides both.
This is a type of unsupervised learning aimed at finding meaningful intrinsic structures within the data by grouping similar points together. The division of the data into similar segments enables the decision-makers to target specific groups of individuals with campaigns better suited for them. This is another capability of our platform.
A fundamental assumption in machine learning is that the date that will be encountered in deployment comes from the same distribution as the data used for learning. Deviations from this assumption usually result in a degradation of the performance of the model. Our platform automatically tracks the performance of the models and schedules a new training session whenever necessary.
We believe that we have solved some of the most burdensome challenges facing data scientists and that the road is clear now for building accurate models in a much shorter timeframe.