With COVID-19, many were affected by the economic crisis and lost their jobs. In Portugal alone, between February and September, there was a 30% increase in unemployment! AI can be a powerful tool in allocating scarce resources in a more efficient way. Inspired by DSSG Fellowship’s Project in Partnership with IEFP (Instituto de Emprego e Formação em Portugal) we started to think about how we would help reduce unemployment using AI.
Similar to DSSG’s project, the goals that will be discussed here are:
Better identify individuals at high risk of long-term unemployment;
Support more efficient allocation of the employment/training institute’s resources to respond to the needs of unemployed individuals.
This is a summary from an internal non-exhaustive discussion merely for learning purposes – there are multiple solutions, all of which depend on the development time and the data which you can access.
Data and Relevant Patterns
Let’s assume that we can gather data when the unemployed person registers on the website platform, and all the job-related interactions are stored until the candidate finds a job. For simplification purposes, we can build our dataset given a list of monthly snapshots of all platform users’ characteristics, which have a unique identifier.
What raw information would be interesting to have?
Candidate
Demographic (Age, Gender, Civil State, Address)
List of previous courses and education level
Curriculum Vitae
Years of Experience
Maximum job distance to home
Schedule in which he can work
Number of dependents
Disabilities
Languages
Skills
How can we encode in a similar way the information from a person with 10 previous courses vs. one with 2?
Encoding Options for List of Previous Courses
Label Encoding: We can represent each course as a number. But this means that the course number k is at a higher distance from course k – 5 than course k – 1, which might not be the best way of representing this.
One Hot Encoding: We can represent each course as a new column. With too high dimensionality, this becomes very sparse.
Content Based: This approach considers a domain-specific representation, where we can encode the courses by several areas of expertise (e.g. Maths, Chemistry, Programming, etc). As values, we can represent it either as binary (course covers that area or not) or represent it by the average grade of the candidate in those areas.
Embedding Based: From a list of courses, train a vector representation (embedding) from scratch or based on content found in external sources so that similar courses have lower distances between them. We can then use a model that’s able to deal with varying input shapes (e.g. RNN, CNN).
Encoding Options for Curriculum Vitae (CV)
The candidate’s CV is also rich with data – we could use several NLP techniques to extract data from here, such as a Bag of Words or the average word embedding value (which has the advantage of having a semantic representation of words).
Note on fairness: We could have added a feature related to the candidate’s monthly expenses, as a way to estimate how much money he would need per month. But this could create a negative feedback loop – if he has low expenses due to not having the ability to live a better life, the model could be allocating him to offers with low salaries.
Context
Unemployment History (e.g. total number of months since last job, statistics of unemployment time in the past – max, mean, etc)
Search Efforts (such as the total number of interviews, number of interviews above the minimum required threshold, percentage of recommended interviews attended)
Search Consistency (e.g. standard deviation of days between interviews and of days between applications). Note that these features might be influenced by job offer scarcity, so it can end up giving unwanted biases to our model!
Offers (Job/Formation)
Offer Name
Offer Areas (e.g. Maths, Science, Programming, …)
Offer Remuneration
Requirements
Offered shifts
Localization
Ideally, we would represent the offer list in a comparable domain to the candidate’s expertise, so our features could represent the similarity between the offer and the candidate skills. As such, offer areas would also be represented in a Content Based approach.
If we were to use a model that cannot learn this relationship directly (e.g. tree based models) we could calculate pairwise features, such as the difference between the offer remuneration and the candidate’s monthly expenses, or the intersection between the candidate and the offer’s areas.
Modeling
After having all these features, we can create a model that, given a candidate snapshot in a month and a job offer, returns an employability score. There are several ways this employability score can be modeled, all of which can be tested and should be picked depending on how the model intends to be applied in production:
Months until a job is found: allows us to determine how many months will the candidate still need to receive a monthly allowance from the government. Therefore, we can prioritize the cases according to this prediction.
Number of interviews until the candidate is accepted: Doesn’t tell us anything about the timeframe, since the candidate can do several interviews in a month or no interviews for several months.
Multiclass risk classification, from Low to High, depending on a threshold (this is what was done in DSSG Fellowship): is not as actionable as the previous ones. However, a model might learn these patterns more easily, as the problem is reduced to an ordinal classification.
Let’s assume, for now, the model gives us the months until the candidate finds a job.
Reliable estimation of the model performance
We can evaluate the model performance using regression metrics such as the Mean Absolute/Squared Error and the Spearman/Pearson correlation coefficients between the target and the predicted value.
Time is an important component in our learning system, and, as such, we must pay attention to the way we split our data for performance estimation. Random splits might “leak” information in training, giving overestimations of the model performance.
As we’re building a dataset with several rows for the same candidate in different months, if we had the same candidate in train and test, but in different months, we’d know whether he found a job or not.
An initial approach is grouping the train/test splits by candidate (A is in train, B is in test, in the above example).
However, this approach still has the issue of time-leakage. Imagine that you were training the model with data from October 2020. You know, since a big shift in employment happened started in March 2020, that the average value of “months until employment” has increased, so you could also have an overestimation of model performance in predictions from months before.
A possible solution is to do both temporal and grouped stratification: for instance, train with a list of applicants from the previous year, and test with a list of other applicants from the current year. In the above example, you could train with data from 2019, with a candidate list not considering A and B, and test it on 2020, in candidates A and B.
Decision-Making
We can optimize the employment institute’s resources by calculating the model performance for a list of available job/formation offers and having a cost function that tells us how good that offer is for that given candidate. We will not extensively discuss this in this blog post.
Imagine, for instance, we want to create an email marketing campaign where we send a list of K offers for each candidate. We can decide (at least) two types of actions:
– Improving candidate’s skills (e.g., courses).
– Improve candidate exposure to jobs (e.g., interviews).
This is an assignment problem with a huge number of possible combinations and budget/time constraints.
Two possible cost functions for this problem are.
argmax [f(Candidate) – f(Candidate + Offer)], which gives us the decay in months until the candidate finds a job, for a given offer. We then want to find the offer that maximizes this decay.
argmax [ (f(Candidate) – f(Candidate+Offer)) x UnemploymentAllowance + Costs(Offer) ], which considers processing costs related with a given offer and incorporates business metrics, such as the money that would be required to maintain that candidate
This can then be optimized efficiently, for instance, with the use of metaheuristics, if the list of possible options is very large.
There are some ethical concerns with such cost functions:
Are we minimizing the average time for employment at the expense of offering jobs with lower salaries?
Will optimizing this leave some people in “starvation” of job offers just because it’s harder to find good offers for them?
To account for this, we can include some extra factors on the loss function to account for this, such as a multi objective search: considering time constraints for finding jobs for everyone and at the same time, reducing the average time people spend without a job.
Conclusion
This blogpost was written after an internal discussion where we discussed this topic. Obviously, it covers only a small fraction of what could be done with this topic. If you’re interested in having such discussions for a specific business problem you have, make sure to contact us!
Like this story?
Subscribe to Our Newsletter
Special offers, latest news and quality content in your inbox once per month.
Signup single post
Recommended Articles
Article
A new era has arrived for NILG.AI
Sep 5, 2022 in
News
Today is NILG.AI’s fourth anniversary. Happy birthday to us! For most humans, birthdays are a synonym for getting older and leaving the good days of the youth behind. For companies, they are a moment to reflect on everything we achieved, recognize how far we have come, and envision how far we will go. So, let’s […]
Trip data is any type of data that connects the origin and destination of a person’s travel and is generated in countless ways as we move about our day and interact with systems connected to the internet. But why is trip data sensitive? The trips we take are unique to us. Researchers have found that […]
Is the fastest route always the best? This article may give you a different perspective if your answer is yes. Normally there are multiple ways to tackle a given problem or task, and the optimization field is no different, as there are different approaches we can take to find an optimal solution. The choice of […]
We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept All”, you consent to the use of ALL the cookies. However, you may visit "Cookie Settings" to provide a controlled consent.
This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
Cookie
Duration
Description
cookielawinfo-checkbox-analytics
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional
11 months
The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy
11 months
The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.
Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.