Over the years, IBM SPSS Modeler® (formerly Clementine) earned the reputation of being the model open environment across data mining platforms.
Its creators were conscious of two things: that they should not create a software program in a vacuum for its own sake, and that future needs and trends could not be precisely defined. This need for openness was a key driver in the early development of the Clementine data mining platform resulting in notable features such as:
- The automatic generation of SQL code by the Clementine application (as opposed to SQL that was hard-coded by the analyst), allowing for fast communication with databases.
- The CEMI interface which opened up the possibility to use additional data mining algorithms, e.g. libraries written in C, developed by third parties.
- The Solution Publisher technology which provided a scoring mechanism that worked independently of the Clementine application environment.
Today, we have tried to reflect this same level of openness in PS CLEMENTINE PRO, a solution built on IBM SPSS Modeler, with a particular focus on integration with R and Python. In this post, we will focus on Python and describe the benefits derived from using it. Whilst Python co-exists with the earlier script language version (the so-called legacy script, or “LS” for short), it has a significantly broader set of capabilities for interaction with the IT environment.
During PS CLEMENTINE PRO program installation a dedicated modeler.api library is installed. This library allows us to control objects in streams and supernodes as well as create scripts managing multiple streams (session scripts). Whilst this can be achieved using LS, Python can do more. Let's look at the first example to see some differences.
Python as a script language
One of features of Python is the possibility to use numerous libraries. For example, the os library provides access to files and folders on the computer's hard disk. Thus, from the PS CLEMENTINE PRO level we can not only define access paths to files (also achieved using LS), but also create folders, collect file information, remove files or change their names. The datetime library is a useful tool for automatically saving reports, in our case reports from each day will be included in separate, automatically created folders. Below is a sample of the code used to create a subfolder with the name being today's date (lines 1-9).
Figure 1. Folder creating script with the use of the os module
The additional two lines (lines 11-12) create access paths to the two separate reports (.csv) generated (for the Masovian and Lesser Poland voivodships), along with today's date in the name. Assuming that the current stream generates such reports with separate export nodes (e.g. 'Flat file' export nodes), we can automatically create paths that will be used by these nodes to save files.
The os library also enables search for folder content in order to find the file/folder with a specified name. In combination with the re library (handling of regular expressions) it can be used to e.g. load all the files with the names containing a specified fragment such as "report", ".csv" or "2020-06".
Accelerating you work
Python also allows the creation of your own functions, which can greatly simplify your work. This is very useful for handling complicated operations, loops or conditional instructions, and will not only shorten the working time, but will also shorten the code itself, improving its readability and reducing errors.
Figure 2. Python sample customized function
The sample function above returns the name and note of the selected node which can for instance be used for the documentation of the operations in a stream.
A user-defined function can be saved in a python script file (.py) or a text file and used in other scripts in the future. Nothing prevents us from sharing the created function with other team members who also use PS CLEMENTINE PRO. Furthermore, if we create a set of functions, Python allows the creation of a file (let's call it .our-library), which will operate in the same way as the above libraries. After the appropriate configuration, it would be sufficient to enter at the beginning of any script: "import our-library" to gain access to all the functions contained therein.
Python in PS CLEMENTINE PRO is used not only in stream scripts, but also by some nodes. We can find them on the node palette in the Python tab.
Figure 3. Python tab on the node palette
Here you will find additional algorithms and analytical techniques such as advanced machine learning models (XGBoost tree, Random forest, SVM, KDE), algorithms for data preprocessing needed for building other models (SMOTE), as well as nodes allowing import and export to the JSON format. All these algorithms are written in Python. NOTE: Some of them require installation of certain external libraries (e.g. imbalanced-learn for SMOTE).
For more advanced users, PS CLEMENTINE PRO offers exceptional functionalities included in extension nodes. These are the nodes inside which it is possible to enter code in the R language or in PySpark framework, namely Python API, to Spark - a programming environment for distributed calculations.
Figure 4. Extension nodes
These nodes correspond to the types of PS CLEMENTINE PRO nodes, therefore, here we have data import, record and variable processing, data modeling and export nodes. With them, we can e.g.:
- download data from a url address (and other sources handled by Python),
- load any number of files at once and flexibly combine them inside the same node,
- create variables in a new way – utilizing PySpark functions
- edit previous variables with the use of Python functions,
- create and assess models unavailable in basic Modeler nodes,
- create our own versions of Python algorithm interfaces with any options we recognize as important,
- generate results in the text and visual form,
- format and export data e.g. to the JSON file format
NOTE: The use of these functions requires at least basic knowledge about PySpark.
Figure 5. Sample code in PySpark framework
The example above presents a Transformation Extension node. Its task is to add a new variable containing the Levenstein distance (a metric of the distance between two character strings, the value of which increases along with the growth of the changes necessary to create string B from string A). In this example, the distance is measured between variables "proper_words" and "misspelled_words". Further in the code, after calculating the edit distance, the records are selected where the Levenstein distance is lower than 3.
The result of this node's operation is as follows.
Figure 6. Result from the Table node, containing the calculated Levenstein distance
Extension nodes allow use of different Python libraries. Therefore, nothing prevents us from using in our stream NumPy, pandas, Matplotlib, Seaborn or Scikit-learn. This gives us a very broad set of possibilities both in the context of data modeling as well as processing.
Using the Transformation Extension node it is also possible to use complex regular expressions using the re module. These in turn may support text mining studies, a broad topic which warrants a separate discussion.
We hope that, in this fast review of the functions, we have managed to encourage you to take a closer look at Python in PS CLEMENTINE PRO and start your own adventure with it.