Coming soon

Groups pages will be added as we roll out new features

Coming soon

Blogs will be added as we roll out new features

Coming soon

Jobs pages will be added as we roll out new features

Text From Machine Learning to Neural Networks

Friday, 28th August 2015

During August, Patrick Hall, Data Scientist at SAS North Carolina hosted a tour around Australia sharing his experiences in Machine Learning and Neural Networks.

Below is a synopsis of his presentation prepared by member Stephen Simmonds.

Data can be an arcane business. To the outside world, an interest in data is tantamount to nerdy. But to those of us in the know, mucking around in data can be endlessly fascinating - and it's easy to get inspired by the enthusiasm of a data scientist.

Data can be an arcane business. To the outside world, an interest in data is tantamount to nerdy. But to those of us in the know, mucking around in data can be endlessly fascinating - and it's easy to get inspired by the enthusiasm of a data scientist.

He started with an illustration of the trends in data capacity: as a consequence of the constant increase in disk capacity, the time taken to transfer a whole diskful of data is also increasing: thus the interest in storage clusters and distributed processing.

This brings new issues and demands new ways of thinking. Sorting, for example, becomes much more problematic and is not a common function within the rawness of big data. And despite ever-increasing storage capacity per dollar, more infrastructure is needed overall.

Big data is heading to the trough of disillusionment in Gartner's Hype Cycle. Yet the Internet of Things - in which such volumes of data stream from high-availability sensors - remains static this year at Gartner's peak of hype. Some attendant challenges from the IoT include capturing and aggregating the data in a timely way (sometimes from literally millions of sensors in hi-tech factories); and dealing with sensors that may sometimes be imperfect, resulting in gaps in those streams.

Despite current adoption being lower than Gartner anticipated, Hall "certainly thinks Hadoop technology is the way of the future". But "it needs to be cheaper, it needs to be faster"; many have tried it but decided to wait.

Generally, the work of the data scientist involves refining algorithms to predict outcomes from future data using an initial set of training data. The very advent of big data has made this more feasible, and more meaningful: "we finally have tools, and plenty of data - ideally enough data that justifies doing something other than regression."

Machine learning emerged from computer science, and involves working with, and refining, algorithms that use (learn from) data to make predictions. It can involve various degrees of human supervision or involvement in the decision-making. In Supervised learning, the data scientist directs the course of the algorithm modelling, based on knowing what the outcomes should be (with the training data). This can involve simple regression models, classification, categorisation into yes/no buy/don't decisions. Unsupervised learning uses more complex models like clustering and PCA (Principle Component Analysis)/feature selection. In this case (such as when categorising into clusters), the outcomes aren't known. Semi-supervised learning involves a bit of both, with some of the outcomes in the training data known.

Neural networks (Hall's particular area of interest) are a family of machine learning models that approximate functions based on a large number of inputs, something like a brain's network. These are used for supervised machine learning, and have one or two hidden layers (where the first layer is the original data, the second is the transformed set, and the third is transformed again - each time, the transforms would result in a smaller set of data than the original). Deep learning is a subset of neural networks, where there are more than two hidden layers - and thus the outcomes are more complex. Accuracy can be improved just beyond the limits of human.

By looking for regularities (or patterns) in the data, each layer acts to summarize the data in the previous layer, again resulting in a smaller but more focused dataset.

A typical example of deep learning is face recognition. From the data in the first layer, successive layers are built up: from the initial pixels through edge detection to feature recognition to faces, thence to identifying individuals. For comparison/identification purposes, the factors of individual faces can be condensed to a two-dimensional vector that can be graphed as a scatter plot, with proximity indicating likeness.

Hall's discussion focused on what he thought was particularly important in the emerging data science: 'Machine Learning for X' (what I would call Applied Machine Learning). In statistics, there's a lot of assumption-checking, but a lot is interpretable. Hall: "To me, machine learning is fewer assumptions, less interpretability, greater accuracy".

Examples given of existing applications of machine learning included:

  • Security: face recognition - as mentioned above;
  • Health Care: epidemiology; predicting hospital re-admissions (a big issue in the US, where government payments to hospital may stop where re-admissions for the same complaint happen too soon); Electronic Medical Records (introduced with some passivity in Australia) are also a treasure trove, but with likely quality issues;
  • Asset Protection: where a given asset is very costly to replace - or costly to be offline - such as airplane engines, MRI machines, and wind farms, very close scrutiny brings great rewards;
  • Manufacturing: to lower defect rates beyond that achievable by human inspection;
  • Government: with certain government agencies so underfunded, automation is sometimes a necessity. For example, policing regulations by interpreting the height of child safety fences from shadows case in satellite photos;
  • Energy: for example, pattern recognition in the early detection of burst pipes - signs of imminent rupture can be detected well before human observation would help.

Ideally, human resources are focused on the most important part of business processes, leaving for machine learning anything that can be done autonomously. What we can call business processes can become so finely calibrated or so complex that they are not well governable by people.

There are a couple of evident downsides to this. Hall periodically made mention of privacy issues associated with technology that aims to identify. One example given was a company that planned to scour Facebook photos, from this to infer a birthday event and thus a marketing opportunity; face recognition ipso facto presents potential for privacy abuses.

The other problematic issue is the very abstraction of humans from decision-making processes. Hall's response: don't be afraid - yet.

As a post-script, in comparing the data science environment in Australia favourably with that of the US, Hall noted an unexpected phenomenon in India. With large companies with big revenue having the best opportunities for deep analysis, those same companies showed less desire to investigate open source options, perhaps because open source solution providers were less likely to have offices in India compared with the larger proprietary vendors.

If you're of a mind for more technical definitions, you can read the Wikipedia entries on neural networks and machine learning; however, you may get more value out of Patrick Hall's own answer to a question posed on deep learning vs neural networks, at Quora.com, particularly the hourglass illustration of the layers of transformations in deep learning.

For more reading please review:

The Use of Open Source is Growing

Machine Learning With SAS® Enterprise Miner

Comments are turned off