What is machine learning (ML)?
Machine learning is the semi-automated extraction of knowledge from data.
- Knowledge from data starts with a question that might be answerable using data.
- Automated extraction via data mining tools provides useful data sets to drive machine learning
- ML is semi-automated meaning smart decisions by a human are needed at various points in the data pipeline
Find the original article here: Intro to machine learning with scikit-learn
Supervised learning is the first category of ML
Making predictions using data
Example: When given a set of emails, can we predict if an if email is “SPAM” or “Content”
There is a specific outcome we are trying to predict
Unsupervised learning is the second category of ML
This is about extracting structure out of data
Example: Segment grocery store shoppers into clusters that exhibit similar behaviors, there is no right answer.
At JobGetter.com we using both techniques to segment Job Postings into clusters based on different Roles such as Barista, Waiter or Chefs
How does machine learning “work”
- First, train a machine learning model using labeled with the outcome
- “Machine learning model” learns the relationship between the attributes of the data and it’s outcome
- At JobGetter.com we have a human categorized list of 100,000 plus roles and skills that allow us to easily train our machine models
- Second, make predictions on new data for which the label is unknown
- The primary goal of supervised learning is to build a model that ‘generalizes”, it accurately predicts the future rather than the past.
- We utilize our past manual categorization of roles, skills (hard & soft) and qualifications to predicate new values as Job postings hit our system.
Watch the full video below or read the Intro to Machine Learning article to hear more about:
- How do I choose which attributes of my data to include in the model?
- How do I choose which model to use?
- How do I optimize this model for best performance?
- How do I ensure that I’m building a model that will generalize to unseen data?
- Can I estimate how well my model is likely to perform on unseen data?