Have you ever heard of Orange Data Mining? It is a powerful data mining framework that includes Python API and a visual GUI.
This blog post is for people looking to create add new attributes dynamically to your data set. It will also provide a good example of scripting in Orange.
The Problem
I have a CSV dataset with an attribute that has multi-values possible and I want to create a new attribute for every value present in the multi-values attribute.
Here is a simple example that is easier to understand than my actual dataset.
Initial dataset
country | flag_colors |
---|---|
italy | green,white,red |
united kingdom | red,blue,white |
russia | white,blue,red |
canada | red,white |
brazil | green,blue,yellow |
germany | black,red,yellow |
Target dataset
country | ... | green | white | red | blue | yellow | black |
---|---|---|---|---|---|---|---|
italy | ... | 1 | 1 | 1 | 0 | 0 | 0 |
united kingdom | ... | 0 | 1 | 1 | 1 | 0 | 0 |
russia | ... | 0 | 1 | 1 | 1 | 0 | 0 |
canada | ... | 0 | 1 | 1 | 0 | 0 | 0 |
brazil | ... | 1 | 0 | 0 | 1 | 1 | 0 |
germany | ... | 0 | 0 | 1 | 0 | 1 | 1 |
Custom Script Solution
I have annotated my solution.
from Orange.data import Table, Domain, ContinuousVariable, DiscreteVariable
import copy
attributes_to_expand = ['colors']
separator = "|" #Separator can be changed to , ; -
# Structure used to build the new model
attributes_to_keep = []
class_vars_to_keep = []
metas_to_keep = []
variables=[] # Old and new variables
all_values = dict() #values to keep for each multi-values columns
# Building a list of all the known values for each column to expand
for data in in_data:
for att_exp in attributes_to_expand:
values = data[att_exp].value.split(separator)
for v in values:
if(not(att_exp in all_values)):
all_values[att_exp] = set() #One new attribute per value maximum
all_values[att_exp].add(v)
# Keeping existing metadata and class variables (target)
for orig_meta in in_data.domain.metas:
metas_to_keep.append(orig_meta)
for orig_class_var in in_data.domain.class_vars:
class_vars_to_keep.append(orig_class_var)
# Keeping non-multi-values variables
for orig_var in in_data.domain.variables:
if(orig_var.name in attributes_to_expand or orig_var in class_vars_to_keep):
continue
variables.append(copy.copy(orig_var))
attributes_to_keep.append(orig_var)
# Adding the list of all the new variables
for att_exp in attributes_to_expand:
for v in all_values[att_exp]:
variables.append(ContinuousVariable(att_exp+"="+v))
# Output Table construction
## Domain describes the variables of our dataset
domain = Domain(variables,class_vars=class_vars_to_keep,metas=metas_to_keep)
## Table include both the domain definition and the data
table = Table.from_domain(domain,len(in_data))
## Rebuilding the data line by line
for index,data in enumerate(in_data):
# Variables that we keep as-is
for att in class_vars_to_keep:
table[index][att] = data[att].value
for att in attributes_to_keep:
table[index][att] = data[att].value
for meta in metas_to_keep:
table[index][meta] = data[meta].value
# New variables
for att_exp in attributes_to_expand:
for v in all_values[att_exp]:
values_for_current_line = data[att_exp].value.split(separator)
value_is_present = v in values_for_current_line
table[index][att_exp+"="+v] = True if value_is_present else False
# making the new dataset available to linked widget
out_data = table
Tips for Building Scripts
- Don't copy objects such as Domain's variable from in_data to out_data. Altering in_data can cause side effects.
- Don't forget to class variables and metadatas. The information could be helpful later in your pipeline.
-
print()
debugging is your friend. There is also an interactive console that is very helpful. - Some shortcuts are available
[Ctrl]-R
: Run script,[Ctrl]-R
: Save script,[Ctrl]-/
: Comment line ... There is also a Vim mode for the purist. 😉 - Look at existing transformations before building your own
Conclusion
Transforming multi-values columns into multiple attributes is very helpful to apply machine learning algorithm to classify and do prediction based on samples data.
Maybe this small script will save you some time...
Top comments (1)
UPDATE: There is now an official plugin that do the same thing !
twitter.com/OrangeDataMiner/status...