Feature Engineering for Machine learning: Creating Predictive Modeling

Yonatan Natan

Want to learn more?

Contact us

In the following three-part series I will explain the process of feature engineering for predictive modeling, continue to why automated feature engineering can be very useful, and show an example of how we at Pecan are using autoencoders to create new features automatically.

What is feature engineering?

Structured and unstructured features

Machine learning models usually create their predictions using a set of inputs called features. Features can be broadly classified into two types:

1.   Structured features - tables of data with different attributes: price, number of occurrences, length, etc.

Structured data example
Figure 1 - Structured housing data

2.   Unstructured features - image, video, sound signal, radar signal, etc.

Unstructured features example
Figure 2 - unstructured image dataset (CIFAR-10). Each image is represented as a tensor of 32x32x3 pixel values.

Structured data features can have different types of values. For example, in a cars related dataset, we can find features such as the number of doors, engine size, and car type. Each of these features will have a unique set of values: an integer in the range 1-7 for doors, a float in the range 200-6000 for engine size, and a string label such as “truck,” “sport,” “sedan,” “SUV” for car type.

Unstructured data is homogeneous and has a similar data type for all the features. Image data, for example, consists of many pixels, each of which is a vector of three RGB numbers. Text data is a sequence of characters, where each character in the sequence is drawn from the same pool of characters.

Raw features and derived features

The original features of the data are called raw features. We can always create additional features by manipulating the raw features in various ways. Raw features can be aggregated using averages, or sums, or counts. We can create different representations of features using transformations such as principal component analysis, and Fourier transforms. We can multiply or add features to get a more complex derived feature, as in manipulating the body mass and height of patients to create a body-mass-index (BMI).

The process of deriving new features from raw features is called feature engineering.

Why should we do feature engineering at all - Is it adding any new data?

Actually, with a perfect machine learning algorithm, feature engineering would not be necessary. A Perfect learning model should be able to find a way to manipulate the raw features and yield optimal predictions. But in reality, raw features can be combined in endless ways, and trying all possible combinations is difficult (there are some clever attempts though, like feature tools for python). That’s why human data scientists are working hard and constantly consult with domain experts to engineer better features for their machine learning models. Feature engineering does not create new samples of the data, but it directs the machine learning model towards derived features that are easier to relate to the modeling target.

Common types of feature engineering

1.     A simple combination of features. Raw features can be combined in all possible ways to create more meaningful representations.

For example:

a.     Two coordinate features can be combined to yield distance, which in some cases is more informative. Here is an example.

b.     The logarithm of a feature can linearize an exponential feature-target relation. In some cases, it can also shift the distribution of value towards a more ‘normal’ shape. This is commonly used with financial data features. Here is an example (the logarithm here is actually applied to the target, not to one of the features).

2.     Aggregations of data to a single number. When aggregating, we take several values and combine them using an aggregating function to a single value. In addition to providing higher level information, aggregation standardizes data with varying feature-length into a fixed shape (where all samples have the same number of features), a necessity for many machine learning models.

For example:

a.     Use a running average of past sales to predict future sales. Here is an example.

b.     Count the number of previous loans for each loan applicant. Here is an example.

In summary, feature engineering is very important when creating data models, and is a major factor in boosting performance. Better features can produce simpler and more flexible models, that often yield better results.

The next parts of this series will discuss methods of automated feature engineering and focus on using autoencoder deep neural networks to create high level features.

It's time to plug your organization into its future.

Try Pecan