Category Archives: Rittman Mead
Oracle Data Visualization Desktop v3
The ODTUG Kscope17 conference last week in San Antonio was a great event with plenty of very interesting sessions and networking opportunities. Rittman Mead participated during the thursday deep dive BI session and delivered three sessions including a special "fishing" one.
— Andrew Fomin (@fomin_andrew) 28 giugno 2017
In the meantime Oracle released Data Visualization Desktop 12.2.3.0.0 which was presented in detail during Philippe Lions session and includes a set of new features and enhancements to already existing functionalities. Starting from new datasources, through new visualization options, in this post I'll go in detail on each of them.
Data Sources
The following new datasources have been introduced:
- Oracle Docs: Oracle's content sharing cloud product
- OData: Open Data Protocol, a standard protocol for RESTful APIs.
- JDBC
- ODBC
The latter two (still in beta) are very relevant since they enable querying any product directly exposing JDBC or ODBC connectors (like Presto) without needing to wait for the official support in the DVD list of sources.
Still in DVD v3 there is no support for JSON or XML files. In my older blog post I wrote how JSON (and XML) can be queried in DVD using Apache Drill, however this solution has Drill installation and knowledge as a prerequisite which is not always achievable in end users environment where self-service BI is happening. I believe future versions of DVD will address this problem by providing full support to both data sources.
Connection to OBIEE
One of the most requested new features is the new interface to connect to OBIEE: until DVD v2 only pre-built OBIEE analysis could be used as sources, with DVD v3 OBIEE Subject Areas are exposed making them accessible. The set of columns and filters can't be retrieved on the fly during the project creation but must be defined upfront during datasource definition. This feature avoids move back and forth from OBIEE to DVD to create an analysis in as datasource, and then use it in DVD.
Another enhancement in the datasource definition is the possibility to change the column delimiter in txt sources, useful if the datasource has an unusual delimiters.
Data Preparation
On the data-preparation side we have two main enhancements: the convert-to-date and the time grain level.
The convert-to-date feature enhances ability for columns to date conversion including the usage of custom parsing strings. Still this feature has some limits like not being able to parse dates like 04-January-2017
where the month name is complete. For this date format a two step approach, reducing the month-name and then converting, is still required.
The second enhancement in the data preparation side is the time grain level and format, those options simplify the extraction of attributes (e.g. Month, Week, Year) from date fields which can now be done visually instead of writing logical SQL.
The Dataflow component in DVD v3 has an improved UI with new column merge and aggregation functionalities which makes the flow creation easier. Its output can now be saved as Oracle database or Hive table eliminating the need of storing all the data locally.
It's worth mentioning that Dataflow is oriented to self-service data management: any parsing or transformation happens on the machine where DVD is installed and its configuration options are limited. If more robust transformations are needed then proper ETL softwares should be used.
New Visualization Options
There are several enhancement on the visualization side, with the first one being the trendlines confidence levels which can be shown, with fixed intervals (90%, 95% or 99%)
Top N and bottom N filtering has been added for each measure columns expanding the traditional "range" one.
Two new visualizations have also been included: waterfall and boxplot are now default visualizations. Boxplots were available as plugin in previous versions, however the five number summary had to be pre-calculated; in DVD v3 the summary is automatically calculated based on the definition of category (x-axis) and item (value within the category).
Other new options in the data visualization area include: the usage of logarithmic scale for graphs, the type of interpolation line to use (straight, curved, stepped ...), and the possibility to duplicate and reorder canvases (useful when creating a BI story).
Console
The latest set of enhancements regard the console: this is a new menu allowing end users to perform task like the upload of a plugin that before were done manually on the file system.
The new Oracle Analytics Store lists add-ins divided into categories:
- PlugIn: New visualizations or enhancement to existing ones (e.g. auto-refresh, providing a similar behaviour to OBIEE's slider)
- Samples: Sample projects showing detailed DVD capabilities
- Advanced Analytics: custom R scripts providing non-default functionalities
- Map Layers: JSON shape files that can be used to render custom maps data.
The process to include a new plugin into DVD v3 is really simple: after downloading it from the store, I just need open DVD's console and upload it. After a restart of the tool, the new plugin is available.
The same applies for Map Layers, while custom R scripts still need to be stored under the advanced_analytics\script_repository
subfolder under the main DVD installation folder.
As we saw in this blog post, the new Data Visualization Desktop release includes several enhancement bringing more agility in the data discovery with enhancements both in the connections to new sources (JDBC and ODBC) and standard reporting with OBIEE subject areas now accessible. The new visualizations, the Analytics Store and the plugin management console make the end user workflow extremely easy also when non-default features need to be incorporated. If you are interested in Data Visualization Desktop and want to understand how it can be proficiently used against any data source don't hesitate to contact us!
Common Questions and Misconceptions in The Data Science Field
There are many types of scenarios in which data science could help your business. For example, customer retention, process automation, improving operational efficiency or user experience.
It is not however always initially clear which questions to concentrate on, or how to achieve your aims.
This post presents information about the type of questions you could address using your data and common forms of bias that may be encountered.
Types of Question
Descriptive: Describe the main features of the data, no implied meaning is inferred. This will almost always be the first kind of analysis performed on the data.
Exploratory: Exploring the data to find previously unknown relationships. Some of the found relationships may define future projects.
Inferential: Looking at trends in a small sample of a data set and extrapolating to the entire population. In this type of scenario you would end up with an estimation of the value and an associated error. Inference depends heavily on both the population and the sampling technique.
Predictive: Look at current and historical trends to make predictions about future events. Even if x predicts y, x does not cause y. Accurate predictions are hard to achieve and depend heavily on having the correct predictors in the data set. Arguably more data often leads to better results however, large data sets are not always required.
Causal: To get the real relationship between variables you need to use randomised control trials and measure average effects. i.e. if you change x by this much how does y change. Even though this can be carried out on observed data huge assumptions are required and large errors would be introduced into the results.
Biases in data collection or cleaning
It is very easy to introduce biases into your data or methods if you are not careful.
Here are some of the most frequent:
Selection/sampling bias: If the population selected does not represent the actual population, the results are skewed. This commonly occurs when data is selected subjectively rather than objectively or when non-random data has been selected.
Confirmation bias: Occurs when there is an intentional or unintentional desire to prove a hypothesis, assumption, or opinion.
Outliers: Extreme data values that are significantly out of the normal range of values can completely bias the results of an analysis. If the outliers are not removed in these cases the results of the analysis can be misleading. These outliers are often interesting cases and ought to be investigated separately.
Simpson's Paradox: A trend that is indicated in the data can reverse when the data is split into comprising groups.
Overfitting: Involves an overly complex model which overestimates the effect/relevance of the examples in the training data and/or starts fitting to the noise in the training data.
Underfitting: Occurs when the underlying trend in the data is not found. Could occur if you try to fit a linear model to non linear data or if there is not enough data available to train the model.
Confounding Variables: Two variables may be assumed related when in fact they are both related to an omitted confounding variable. This is why correlation does not imply causation.
- Non-Normality: If a distribution is assumed to be normal when it is not the results may be biased and misleading.
- Data Dredging: This process involves testing huge numbers of hypotheses about a single data set until the desired outcome is found.
Citations:
Comics from Dilbert Comics By Scott Adams.
Spurious Correlations from http://tylervigen.com/spurious-correlations.
Insights Lab
To learn more about the Rittman Mead Insights Lab please read my previous blog post about our methodology.
Or contact us at info@rittmanmead.com
OAC: Essbase – Incremental Loads / Automation
I recently detailed data load possibilities with the tools provided with Essbase under OAC here. Whilst all very usable, my thoughts turned to systems that I have worked on and how the loads currently work, which led to how you might perform incremental and / or automated loads for OAC Essbase.
A few background points:
- The OAC front end and EssCS command line tools contain a ‘clear’ option for data, but both are full data clears – there does not seem to be a partial or specifiable ‘clear’ available.
- The OAC front end and EssCS command line tools contain a ‘file upload’ function for (amongst other things) data, rules, and MAXL (msh) script files. Whilst the front-end operation has the ability to overwrite existing files, the EssCS Upload facility (which would be used when trying to script a load) seemingly does not – if an attempt is made to upload a file that already exists, an error is shown.
- The OAC ‘Job’ facility enables a data load to be conducted with a rules file; the EssCS Dataload function (which would be used when trying to script a load) seemingly does not.
- MAXL still exists in OAC, so it is possible to operate at Essbase ‘command level’
Whilst the tools that are in place all work well and are fine for migration or other manual / adhoc activity, I am not sure what the intended practice might be around some ‘real world’ use cases: a couple of things that spring to mind are
- Incremental loads
- Scheduled loads
- Large ASO loads (using buffers)
Incremental Loads
It is arguably possible to perform an incremental load in that
- A rules file can be crafted in on-prem and uploaded to OAC (along with a partial datafile)
- Loads appear to be conducted in overwrite mode, meaning changed and new records will be handled ok
It is possible that (eg) a ‘current month’ data file could be loaded and reloaded to form an incremental load of sorts. The problem here will come if data is deleted for a particular member combination in the source from one day to the next – with no partial clear (eg, of current month data) seemingly possible, there is no way of clearing redundant values (at least for an ASO cube…for a BSO load, the ‘Clear combinations’ functionality of the load rules file can be used…although that has not yet been tested on this version).
So in the case of an ASO cube, the only option using available tools would be to ensure that ‘contra’ records are added to the incremental load file. This is not ideal, as it is another process to follow in data preparation, and would also add unnecessary zeros to the cube. For these reasons, I would generally look to effect a partial clear of the ‘slice’ being loaded before proceeding with an incremental the load.
The only way I can see of achieving this under OAC would be to take advantage of the fact that MAXL is available and effect the clear using alter database clear data.
This means that the steps required might be
- Upload prepared incremental data file (either manually via OAC or via EssCS UploadFiles after having first deleted the existing file)
- Upload on-prem prepared rules file (either manually via OAC or via EssCS UploadFiles after having first deleted the existing file)
- Access the OAC server (eg via Putty), start MAXL, and run a command to clear the required slice / merge slices (if necessary)
- In OAC, create / run a job for the specified data file / rules file
I may have missed something, but I see no obvious way of being able to automate this process with the on-board facilities.
Automating the load process
Along with the points listed above, some other facts to be aware of:
- It is possible to manually transfer files to OAC using FTP
- It is possible to amend the cron scheduler for the oracle user in OAC
Even bearing in mind the above, I should caveat this section by saying getting ‘under the hood’ in this way is possibly not supported or recommended, and should only be undertaken at your own risk.
Having said that…
By taking advantage of the availability of FTP and cron, it should be possible to script a solution that can run unattended, for full and incremental loads. Furthermore, data clears (full or partial) can be included in the same process, as could parallel buffer loading for ASO or any other MAXL-controllable process (within the confines of this version of Essbase).
The OAC environment
A quick look around discloses that the /u01/latency directory is roughly the equivalent of the ../user_projects/epmsystem1/EssbaseServer/essbaseserver1 (or equivalent) directory in an on-prem release in that it contains the /app ‘parent’ directory which in turn contains a subdirectory structure for all application and cube artefacts. Examining this directory for ASOSamp.Basic shows that the uploaded dataload.* files are here, along with all other files listed by the Files screen of OAC:
Note that remote connection is via the opc user, but this can be changed to oracle once connected (by using sudo su – oracle).
As oracle, these files can be manually deleted…doing so means they will no longer be found by the EssCS Listfiles command or the Files screen within OAC (once refreshed). If deleted manually, new versions of the files can be re-uploaded via either of the methods detailed above (whilst an overwrite option exists in the OAC Files facility, there seems to be no such option with the EssCS Upload feature…trying to upload a file that already exists results in an error.
All files are owned by the oracle user, with no access rights at all for the opc user that effects a remote connection via FTP.
Automation: Objectives
The objective of this exercise was to come up with a method that, unattended, would:
- Upload received files (data, rules) to OAC from a local source
- Put them in the correct OAC directory in a usable format
- Invoke a process that runs a pre-load process (eg a clear), a load, and (if necessary a post load process)
- Clear up after itself
Automation: The Process
The first job is to handle the upload of files to OAC. This could be achieved via a psftp script that uploads the entire contents of a nominated local directory:
The EssCSUpload,bat script above (which can, of course be added to a local scheduler so that it runs unattended at appointed times) passes a pre-scripted file to psftp to connect and transfer the files. Note that the opc user is used for the connection, and the files are posted to a custom-created directory, CUSTOM_receive (under the existing /u01/latency). The transferred files are also given a global ‘rw’ attribute to assist with later processing
Now the files are in the OAC environment, control is taken up there.
A shell script (DealWithUploads) is added to the oracle home directory:
This copies all the files in the nominated receiving directory to the actual required location – in this case, the main ASOSamp/Basic directory. Note the use of ‘-p’ with the copy command to ensure that attributes (ie, the global ‘rw’) are retained. Once copied, the files are deleted from the receiving directory so that they are not processed again.
Once the files are copied into place, startMAXL is used to invoke a pre-prepared msh script:
as can be seen, this clears the cube and re-imports from the uploaded file using the uploaded rules file. The clear here is a full reset, but a partial clear (in the case of ASO) can be used here instead if required
As with the ‘local’ half of the method, the DealWithUploads.sh script file can be added to the scheduler on OAC: the existing cron entries are already held in the file /u01/app/oracle/tools/home/oracle/crontab.txt; it is a simple exercise to schedule a call to this new custom script.
A routine such as this would need a good degree of refinement and hardening – the file lists for the transfers should be self-building, passwords need to be encrypted, the MAXL script should only be called if required, the posting locations for files should be content/context sensitive, etc – but in terms of feasibility testing the requirements listed above, it was successful.
This approach places additional directories and files in an environment / structure that could be maintained at any time: it is therefore imperative that some form of code control / release mechanism is employed so that it can be replaced in the event of any unexpected / uncontrollable maintenance taking place on the OAC environment that could invalidate or remove it.
Even once hardened, I think there is a considerable weak spot in this approach in that the rules file seemingly has to be crafted in an on-prem environment and uploaded: as I detailed here, even freshly-uploaded, working rules files error when an attempt is made to verify them. For now, I’ll keep looking for an alternative.
Summary
Whilst a lot of the high-level functionality is in place around data loads, often with multiple methods, I think there are a couple of detailed functionality areas that may currently require workarounds – to my mind, the addition of the ability to select & run an msh format ‘preload’ script when running a dataload Job (eg for clears) would be useful, whilst a fully functional rules file editor strikes me as important. The fact that an FTP connection is available at all is a bonus, but because this is as a non-oracle user, it is not possible to put a file in the correct place directly - the EssCS Upload faciity does this of course, but the seeming absence of an overwrite option or an additional Delete option for EssCS) somewhat limits its usefulness at this point. But can you implement a non attended, scheduled load or incremental load routine ? Sure you can.
OAC: Essbase – Loading Data
After my initial quick pass through Essbase under OAC here, this post looks at the data loading options available in more detail. I used the provided sample database ASOSamp.Basic, which first had to be created, as a working example.
Creating ASOSamp
Under the time-honoured on-prem install of Essbase, the sample applications were available as an install option – supplied data has to be loaded separately, but the applications / cubes themselves are installed as part of the process if the option is selected. This is not quite the same under OAC – some are provided in an easily installable format, but they are not immediately available out-of-the-box.
One of the main methods of cube creation in Essbase under OAC is via the Import of a specifically formatted Excel spreadsheet, and it is via the provision of downloadable pre-built ‘template’ spreadsheets that the sample applications are installed in this version.
After accessing the homepage of Essbase on OAC, download the provided cube creation template – this can be found under the ‘Templates’ button on the home page:
Note that in the case of the sample the ASOSamp.Basic database, the data is not in the main template file – it is held in a separate file. This is different to other examples, such as Sample.Basic, where the data provided is held in a dedicated tab in the main spreadsheet. Download both Aggregate Storage Sample and Aggregate Storage Sample Data:
Return to the home page, and click Import. Choose the spreadsheet downloaded as Aggregate Storage Sample (ASO_Sample.xlsx) and click Deploy and Close.
This will effect all of the detail in the spreadsheet – create the application, create the cube, add dimensions / attribute dimensions and members to the outline, etc:
Loading ASOSamp.Basic
Because the data file is separate from the spreadsheet, the next step is to uploaded this to OAC so that it is available for loading: back on the home page, select the newly-created ASOSamp.Basic (note: not ASOSamp.Sample as with on-prem), and click Files:
In the right-hand window, select the downloaded data file ASOSampleData.txt and click the Upload button:
This will upload the file:
Once the file upload is complete, return to the home page. With the newly-created ASOSamp.Basic still selected, click Jobs:
Choose Data Load as the Job Type, and highlight the required Data File:
Click Execute.
A new line will be added to the Job Monitor:
The current status of the job is shown – in this case, ‘in progress’ – and the screen can be refreshed.
Once complete, the Status field will show the completion state of the job, whilst the Job Details icon on the right-hand side provides more detail – in this case, confirming that 311,795 records were successfully loaded, and 0 rejected:
The success of the load is confirmed by a quick look in Smartview:
Note that a rules file was not selected as part of the job – this makes sense when we look at the data file…
...which is familiar-looking: just what we would expect from a EAS export (MAXL: export database), which can of course just be loaded in a similar no-rules-file way in on prem.
Incidentally, this is different to the on-prem approach to ASOSamp.Sample where a ‘flat’, tab-delimited data file is provided for the sample data, along with a rules file that is required for the load:
...although the end-results are the same:
This ‘standard’ load works in overwrite mode – any new values in the file will be added, but any that exist already will be overwritten: running the load again and refreshing the Smartview report results in the same numbers confirms this.
This can be verified further by running with a changed data file: taking a particular node of data for the Units measure…
One of the constituent data values can be changed in a copy of the data file – in this example, one record (it doesn’t matter which for this purpose) can be increased – in this case, ‘1’ has been increased to ‘103’:
The amended file needs to be saved and uploaded to OAC as outlined above, and the load process repeated, this time using the amended file. After a successful load, the aggregated value on the test Smartview report has increased by the same 102:
Loading flat files
So, how might we load the same sort of flat, tab delimited file of the like supplied as the on-prem ASOSamp.Sample data file ?
As above, files can be uploaded to OAC, so putting the dataload.txt data file from the on-prem release into OAC is straightforward. However, as you’d expect, attempting to run this as a load job without a rules file results in an error.
However, it is possible to run an OAC load with a rules file created in an on-prem version: firstly, upload the rules file (in this case, dataload.rul) in the same way as the data file. When setting up the load job, select the data file as normal, but under Scripts select the rules file required:
The job runs successfully, with the ‘Details’ overlay confirming the successful record count.
As with rules files generated by the Import facility, uploaded rules files can also be edited in text mode:
It would seem logical that changing the dataLoadOptions value at line 215 to a value other than OVERWRITE (eg ADD) might be a quick behavioural change for the load that would be easy to effect. However, making this change resulted in verification errors. Noting that the errors related to invalid dimension names, an attempt was made to verify the actual, unchanged rules file as uploaded…which also resulted in the same verification errors. So somewhat curiously, the uploaded on-prem rules file can be successfully used to load a corresponding data file, but (effectively) can’t be edited or amended.
Loading from Spreadsheet Template
The template spreadsheets used to build applications can also contain one or more data tabs. Unlike the OAC Jobs method or EssCS Dataload, the spreadsheet method gives you the option of a rules file AND the ability to Add (rather than overwrite) data:
Within OAC, this is actioned via the ‘Import’ function on the home page:
Note that we are retaining all data, and have the Load Data box checked. Checks confirm the values in the file are added to those already in the cube.
The data can also be uploaded via the Cube Designer in Excel under Cube Designer / Load Data:
Note that unlike running this method under OAC, the rules file (which was created by the initial import as the Data tab existed in the spreadsheet at that point) has to be selected manually.
Once complete, an offer is made to view the Job Status Viewer (which can also be accessed from Cube Designer / View Jobs):
With further detail for each job also being available:
Use facilities to upload files
Given the ability to upload and run both data and rules files, the next logical step would be to script this for automated running. OAC contains a downloadable utility, the Command Line Tool (aka CLI , EssCS) which is a number of interface tools that can be run locally against an OAC instance of Essbase:
Login / Logout
Calc
Dataload
Dimbuild
Clear
Version
Listfiles
Download
Upload
LcmExport
LcmImport
Running locally, a successful EssCS login effectively starts a session that then remains open for other EssCS commands until the session is closed with a logout command.
The login syntax suggests the inclusion of the port number in the URL, but I had no success with this…although it worked without the port reference:
As above, the connection is made and is verified by the successful running of another command (eg version), but the logout command produced an error. Despite this, the logout appeared successful – no other EssCS commands worked until a login was re-issued.
With EssCS installed and working, the Listfiles and Upload facilities become available. The function of these tools is pretty obvious from the name. Listfiles should be issued with at least arguments for the application and cube name:
The file type (csc, rul, txt, msh, xls, xlsx, xlsm, xml, zip, csv) can be included as an additional argument…
…although the file types is a fixed list – for example, you don’t seem to be able to use a wild card to pick up all spreadsheet files.
Whilst there is an Upload (and Download) facility, there does not seem to be the means to delete a remote file…which is a bit of an inconvenience, because using Upload to upload a file that already exists results in an error, and there is no overwrite option. The dataload.txt and dataload.rul files previously uploaded via the OAC front end were therefore manually deleted via OAC, and verified using Listfiles.
The files were then uploaded back to OAC using the Upload option of EssCS:
As you would expect, the files will then appear both in a Listfiles command and via OAC:
Note that the file list in OAC does not refresh with a browser page refresh or any ‘sort’ operation: use Refresh under Actions as above.
With the files now re-uploaded, the data can be loaded. EssCS also contains a DataLoad command, but unfortunately there appears to be no means to specify a rules file – meaning it would seem to be confined to overwrite, ‘export data’ style imports only:
A good point here is that the a DataLoad EssCS command makes an entry to the Jobs table, so success / record counts can be confirmed:
Summary
The post details three methods of loading data to Essbase under OAC:
- Via the formatted template spreadsheet (on import or from Cube Designer)
- Via the Command Line Interface
- Via the Jobs facility of OAC
There are some minor differences between them, which may affect which you may wish to use for any particular scenario.
Arguably, given the availability of MAXL, there is a further custom method available as the actual data load can be effected that way too. This will be explored further in the next post that will start to consider how these tools might be used for real scenarios.
Exploring the Rittman Mead Insights Lab
What is our Insights Lab?
The Insights Lab offers on-demand access to an experienced data science team, using a mature methodology to deliver one-off analyses and production-ready predictive models.
Our Data Science team includes physicists, mathematicians, industry veterans and data engineers ready to help you take analytics to the next level while providing expert guidance in the process.
Why use it?
Data is cheaper to collect and easier to store than ever before. But collecting the data is not synonymous with getting value from it. Businesses need to do more with the same budget and are starting to look into machine learning to achieve this.
These processes can take off some of the workload, freeing up people's time to work on more demanding tasks. However, many businesses don't know how to get started down this route, or even if they have the data necessary for a predictive model.
R
Our Data science team primarily work using the R programming language. R is an open source language which is supported by a large community.
The functionality of R is extended by many community written packages which implement a wide variety of statistical and graphical techniques, including linear and nonlinear modeling, statistical tests, time-series analysis, classification, clustering as well as packages for data access, cleaning, tidying, analysing and building reports.
All of these packages can be found on the Comprehensive R Archive Network (CRAN), making it easy to get access to new techniques or functionalities without needing to develop them yourself (all the community written packages work together).
R is not only free and extendable, it works well with other technologies and makes it an ideal choice for businesses who want to start looking into advanced analytics. Python is an obvious alternative, and several of our data scientists prefer it. We're happy to use whatever our client's teams are most familiar with.
Experienced programmers will find R syntax easy enough to pick up and will soon be able to implement some form of machine learning. However, for a detailed introduction to R and a closer look at implementing some of the concepts mentioned below we do offer a training course in R.
Our Methodology
Define
Define a Question
Analytics, for all intents and purposes, is a scientific discipline and as such requires a hypothesis to test. That means having a specific question to answer using the data.
Starting this process without a question can lead to biases in the produced result. This is called data dredging - testing huge numbers of hypotheses about a single data set until the desired outcome is found. Many other forms of bias can be introduced accidentally; the most commonly occurring will be outlined in a future blog post.
Once a question is defined, it is also important to understand which aspects of the question you are most interested in. Associated, is the level of uncertainty or error that can be tolerated if the result is to be applied in a business context.
Questions can be grouped into a number of types. Some examples will be outlined in a future blog post.
Define a dataset
The data you expect to be relevant to your question needs to be collated. Maybe supplementary data is needed, or can be added from different databases or web scraping.
This data set then needs to be cleaned and tidied. This involves merging and reshaping the data as well as possibly summarising some variables. For example, removing spaces and non-printing characters from text and converting data types.
The data may be in a raw format, there may be errors in the data collection, or corrupt or missing values that need to be managed. These records can either be removed completely or replaced with reasonable default values, determined by which makes the most sense in this specific situation. If records are removed you need to ensure that no selection biases are being introduced.
All the data should be relevant to the question at hand, anything that isn't can be removed. There may also be external drivers for altering the data, such as privacy issues that require data to be anonymised.
Natural language processing could be implemented for text fields. This takes bodies of text in human readable format such as emails, documents and web page content and processes it into a form that is easier to analyse.
Any changes to the dataset need to be recorded and justified.
Model
Exploratory Analysis
Exploratory data analysis involves summarising the data, investigating the structure, detecting outliers / anomalies as well as identifying patterns and trends. It can be considered as an early part of the model production process or as a preparatory step immediately prior. Exploratory analysis is driven by the data scientist, enabling them to fully understand the data set and make educated decisions; for example the best statistical methods to employ when developing a model.
The relationships between different variables can be understood and correlations found. As the data is explored, different hypotheses could be found that may define future projects.
Visualisations are a fundamental aspect of exploring the relationships in large datasets, allowing the identification of structure in the underlying dataset.
This is also a good time to look at the distribution of your dataset with respect to what you want to predict. This often provides an indication of the types of models or sampling techniques that will work well and lead to accurate predictions.
Variables with very few instances (or those with small variance) may not be beneficial, and in some cases could even be detrimental, increasing computation time and noise. Worse still, if these instances represent an outlier, significant (and unwarranted) value may be placed on these leading to bias and skewed results.
Statistical Modelling/Prediction
The data set is split into two sub groups, "Training" and "Test". The training set is used only in developing or "training" a model, ensuring that the data it is tested on (the test set) is unseen. This means the model is tested in a more realistic context and will help to determine whether the model has overfitted to the training set. i.e. is fitting random noise in addition to any meaningful features.
Taking what was learned from the exploratory analysis phase, an initial model can be developed based on an appropriate application of statistical methods and modeling tools. There are many different types of model that can be applied to the data, the best tends to depend on the complexity of your data and the any relationships that were found in the exploratory analysis phase. During training, the models are evaluated in accordance with an appropriate metric, the improvement of which is the "goal" of the development process. The predictions produced from the trained models when run on the test set will determine the accuracy of the model (i.e. how closely its predictions align with the unseen real data).
A particular type of modelling method, "machine learning" can streamline and improve upon this somewhat laborious process by defining models in such a way that they are able to self optimise, "learning" from past iterations to develop a superior version. Broadly, there are two types, supervised and un-supervised. A supervised machine learning model is given some direction from the data scientist as to the types of methods that it should use and what it is expecting. Unsupervised machine learning on the other hand, as the name suggests, involves giving the model less information to start with and letting it decide for its self what to value, and how to approach the problem. This can help to remove bias and reduce the number of assumptions made but will be more computationally intensive, as the model has a broader scope to investigate. Usually supervised machine learning is employed in a case where the problem and data set are reasonably well understood, and unsupervised machine learning where this is not the case.
Complex predictive modelling algorithms perform feature importance and selection internally while constructing models. These models can also report on the variable importance determined during the model preparation process.
Peer Review
This is an important part of any scientific process, and effectively utilities our broad expertise in modelling at Rittman Mead. This enables us to be sure no biases were introduced that could lead to a misleading prediction and that the accuracy of the models is what could be expected if the model was run on new unseen data. Additional expert views can also lead to alternative potential avenues of investigation being identified as part of an expanded or subsequent study.
Deploy
Report
For a scientific investigation to be credible the results must be reproducible. The reports we produce are written in R markdown and contain all the code required to reproduce the results presented. This also means it can be re-run with new data as long as it is of the same format. A clear and concise description of the investigation from start to finish will be provided to ensure that justification and context is given for all decisions and actions.
Delivery
If the result is of the required accuracy we will deploy a model API enabling customers to start utilising it immediately.
There is always a risk however that the data does not contain the required variables to create predictions with sufficient confidence for use. In these cases, and after the exploratory analysis phase there may be other questions that would be beneficial to investigate. This is also a useful result, enabling us to suggest additional data to collect that may allow a more accurate result should the process be repeated later.
Support
Following delivery we are able to provide a number of support services to ensure that maximum value is extracted from the model on an on-going basis. These include:
- Monitoring performance and accuracy against the observed, actual values over a period of time. Should there be discrepancies between these values arise, these can be used to identify the need for alterations to the model.
- Exploring specific exceptions to the model. There may be cases in which the model consistently performs poorly. Instances like these may not have existed in the training set and the model could be re-trained accordingly. If they were in the training set these could be weighted differently to ensure a better accuracy, or could be represented by a separate model.
- Updates to the model to reflect discrepancies identified through monitoring, changes of circumstance, or the availability of new data.
- Many problems are time dependent and so model performance is expected to degrade, requiring retraining on more up to date data.
Summary
In conclusion our Insights lab has a clearly defined and proven process for data science projects that can be adapted to fit a range of problems.
Contact us to learn how Insights Lab can help your organization get the most from its data, and schedule your consultation today.
Contact us at info@rittmanmead.com