Crafting the Next Generation ML Engineering Assessment

11 min readMar 23, 2024

Hiring for machine learning engineers is no easy task. A quick look at job postings on LinkedIn reveals that in today’s job market, hundreds of applications are submitted for roles even at early stage startups. This abundance of options from the employer perspective at first seems desirable, but of course it’s not feasible to interview everybody and make a fully informed decision. Sorting through these applicants to find the ones which are best suited for a particular role is where the challenge lies. There are many tools available to help narrow down a set of hundreds of resumes to a few dozen based on a variety of criteria, some of which even use NLP to find the best resumes based on the job description. Still, interviewing all of these remaining candidates may not be the best use of resources for your organization.

I was recently tasked with designing a solution to this problem for ML hiring at GPTZero. The status quo assessment we had in place was a timed hour-long online assessment consisting of a leetcode style question, a basic ML implementation question, and a few multiple choice conceptual questions. After dozens of first round interviews, it became clear that this assessment was not acting as a filter relevant to what we were looking for in a strong ML engineer. In fact, most applicants who attempted the assessment passed and went on to an interview. Noticing that our assessment was not much different from other assessments or interview questions I had completed in the past during my own job search, I realized that this may be an industry-wide problem.

The goal of this post is to give engineering orgs a new way of assessing ML candidates, leading to better decisions about who to interview, and ultimately who to hire. As more teams begin to adopt this kind of approach, those looking for jobs will no longer have to waste their time doing rote memorization of solutions to algorithmic problems that they are unlikely to encounter throughout their role. The first half of the post will discuss the limitations of the initial ML assessment we had implemented, and the second will discuss how I went about designing the new one, giving practical advice throughout.

Limitations of the Previous Assessment

Leetcode is an unreliable proxy of ability

Computer science fundamentals including data structures, algorithms, and time/memory complexity analysis, are definitely important concepts to understand as a software engineer. These core principles are typically tested for using a platform like Leetcode which has thousands of questions ranging from graph traversal to dynamic programming. 20 years ago this may have been a reliable way of testing for these core principles, but nowadays it has become subject to Goodheart’s law which states

When a measure becomes a target, it ceases to be a good measure

Basically, there are a considerable number of applicants out there who have completed hundreds of Leetcode questions. Coupled with the emergence of books like Cracking the Coding Interview, there is a significant bias in preparation for such questions, particularly on the side of those with sufficient time, and those who want to game the system. Two risks emerge from this

Average to below-average candidates in terms of on the job performance will appear to be stellar based on their Leetcode performance
Top-tier candidates who choose not to participate in this gaming of CS fundamentals assessment will appear to be average

Even if this kind of gaming were not an issue, it’s not trivial to choose just one or two Leetcode-style questions which cover all relevant CS concepts.

The Leetcode question we had in the initial ML assessment was based on string manipulation as that is the most common type of data we deal with. We found basically no correlation between strong candidates and those who did well on this question: a clear sign that it should be replaced.

Time constrained settings hide bad coding practices

A time-constrained setting limits the ability to analyze for subtle but important coding best practices related to organization, naming, and formatting. E.g. when in a rush, variable names and modularity are not a priority. This means that 2 candidates who have vastly different levels of experience could end up producing similar, undifferentiated solutions to basic ML problems. For example, the initial assessment had candidates implement a forward and backward pass from scratch for a simple model. Candidates were permitted to search the internet for this part, so it was mainly just a matter of translating equations into numpy code. This equation-copying behavior meant that for the most part all candidates had very similar solutions, if not for the occasional typo.

Constraining the possible solution space to this extent is not representative of real world ML engineering where there are multiple ways of achieving the same goal. Thus, it’s important to evaluate this kind of problem not just on a pass/fail basis, but also structure it such that there is room for analyzing qualitative choices. I call these choices qualitative as they do not necessarily affect efficiency, but they do affect readability and maintainability which will drive code velocity later on. The absence of diverse solutions to this ML question was a sign that the new assessment should be untimed, and that the new ML question should not be so basic as to lead to undifferentiated solutions among candidates.

Multiple choice conceptual questions do not capture reasoning ability

While multiple choice questions are great for automatic evaluation of core ML concepts, they obfuscate the reasoning behind the candidate’s answer. This leaves open a potential for memorizing information without knowing the “why” or “how” behind it. Similar to the limitations with Leetcode-style questions, selecting a representative set of multiple choice conceptual questions is difficult, particularly in a time-constrained setting. There are scenarios where even if a candidate does not know the exact answer to a question due to limited knowledge about the topic, their thinking process could shed light on how they operate in uncertain conditions. This is a crucial skill to have if the candidate is meant to be working on cutting edge products where best practices have yet to be established and research is sparse.

Looking at the top scoring candidates on the multiple choice section and in-person live coding performance, there was again little correlation between the two. Additionally, a strong degree of proctoring is required to ensure that candidates are not simply looking up the answer to these questions. Given that our interview pipeline consists of 3 stages after initial screening, 5 minutes of additional time investment up front to analyze responses in a more nuanced format is well worth the effort to avoid wasting time later on.

Crafting a new assessment

With all the lessons learned from the initial assessment, I came up with the following principles that the new assessment should adhere to

Takes less than 3 hours to complete on average, and does not have a strict time limit
Assesses adaptability and reasoning more than rote memorization
Surfaces lack of collaborative coding experience
Differentiates between average and star candidates
Is representative of what a day in the life of an ML engineer is like

I started off with the last point which is beneficial for the candidate since they get a sneak peek at what it might be like to work with our team, and it lets us know how well they might perform at the role. After a bit of brainstorming, I decided on the following idea: have the candidate add more functionality to the codebase of an ML paper which implements some novel method. Ideally, this requires reading the paper to understand what the code is actually doing. This eliminates any concerns regarding memorization of solutions since it’s not feasible to keep up with all ML research papers as most of you probably know. If designed properly, this kind of assessment does not require proctoring as it is beyond the capabilities of ChatGPT and outside the knowledge base searchable by Google. Here is how I went about it

Finding an appropriate paper

Since we primarily deal with text data at GPTZero, papers which involve text classification methods are a good subset to consider. For practical reasons, it should take less than 20 minutes to train the model described. Another requirement is that the method presented in the paper should not be so complex that it takes a considerable amount of time just to understand what’s going on. Also, the public codebase should be high quality and should be implemented using the same libraries we use in our ML stack. The ML community has converged to using PyTorch to some degree, so chances are that the author’s implementation uses PyTorch. Lastly, the codebase/method needs to be extendable in a meaningful way that shows the candidate that they are making progress on the problem. I’ll get into this in the next section. I decided on a recent paper I had read on AI text detection since I found it to be interesting, and it covers many useful aspects of ML engineering in terms of data preparation, feature engineering, model training, and evaluation.

Modifying the codebase

The part of the assessment which involves adding a novel component to the codebase should take the longest amount of time. In the assessment I made, applicants had to implement a function with well-documented inputs and outputs that performs feature extraction using a small language model. While it sounds simple, there are enough subtleties to consider here that reveal experienced candidates with a high level of attention to detail. Additionally, there is no way for a candidate to verify if they correctly completed this part as many solutions which are missing key details will not negatively impact the rest of the assessment. It may therefore be worthwhile to provide hints regarding gotchas that are not obvious for this part.

Moving on to testing if applicants read through the paper, an option here is to have applicants implement a part of the paper that is relatively easy to translate from equations/pseudocode into code. The existing implementation can simply be removed. There is a risk that applicants will search for the codebase on Github and copy over the solution for this part. However, I noticed only a couple of instances of this which were obvious, so make sure to select code where a direct copy and paste stands out. Applicants should be able to determine if their solution to this part is correct, or is at least non-breaking by running the code.

Debugging is one of the most important skills that a software engineer should have. It is seldom the case that perfect code is written, especially as the scale and complexity of the code increases. Thus, being able to identify what is causing unexpected behavior can save hours of development time. This is part reasoning ability, part attention to detail, and part intuition which comes from experience. Testing for this crucial skill can be as simple as introducing bugs into the codebase, but this should be done in a principled way. It is important to differentiate between two types of bugs here. Breaking bugs which prevent the code from running are the easiest to identify. I added such bugs to test library-specific knowledge that any ML engineer should know. A trivial example here is trying to feed text directly to a model without tokenizing it.

Non-breaking bugs which do not prevent the code from running (on most inputs) are where the plot thickens. These bugs vary in subtlety ranging from causing failures on edge cases, to making resulting models having bad/unexpected outputs. A classic instance of such a bug is data leakage where some of the test data is included in the training set. Another example is to use the wrong loss function for the given learning task. While other programming specializations have their own class of debugging hardships (e.g. concurrency), machine learning has many subtle bugs as it is so data driven, and it is a common trap to treat the entire process like a black box. Being able to navigate through these challenges is another indicator as to whether or not a candidate will excel in their role.

One key aspect of the coding portion of the assessment is that it should span multiple files in a codebase that has some degree of hierarchical organization. The motivation behind this being that the codebase should capture how candidates will navigate code in their day to day work, so try to keep this as realistic as possible. Candidates who mainly work with research quality code that tends to reside in a single monolithic file may be slowed down by this structure, and that may show up in the self-reported amount of time that it takes to complete the assessment.

Simulating a collaborative environment

If your organization’s projects involve multiple people modifying the same part of a codebase, then the ability to handle conflicts between different code versions is vital. This is a trivial skill for many engineers, but some who have mainly worked on siloed projects will struggle with navigating a codebase that is in a merge conflict, and resolving said conflicts. You can introduce such a challenge to your codebase by having two branches which modify the same function in different ways. For example, one branch may fix a bug while the other adds more functionality. These branches can also have diverged from the main branch at different points in time for added complexity. I.e. a file as well as references to its functions/classes may have been renamed/moved. I introduced such a conflict in the codebase for the new assessment and provided some high-level instructions as to how the conflicts should be handled. This will act as a roadblock for less experienced candidates while experienced ones will breeze by.

Tying in a conceptual question

Put yourself in the shoes of an experienced researcher. Reading a research paper is more than just a matter of absorbing information. It’s a critical process where many questions emerge in one’s mind. These questions could be about the data (are there sufficient datasets from a variety of domains used), limitations of the method in terms of scalability, whether the work is truly novel, and where there is room for improvement in the method. There should be plenty of content in the paper where applicants can be asked “what if…” or “are there any limitations of…” types of questions. Pick one such question in the interest of keeping the assessment a reasonable length. There may be multiple acceptable answers for such an open-ended question, so this part of the assessment should function as a differentiator rather than a disqualifier. A top-tier candidate may give an answer that you hadn’t thought of here which is a strong green flag.

Results so far

We’ve been using the new assessment for over a month now and with great results. After collecting about 10 completed assessments, I tweaked some of the questions to calibrate their difficulty. This step is to be expected and is similar to academic settings where professors have teaching assistants attempt exams in order to gauge what students would be capable of. The percentage of candidates willing to do the assessment is surprisingly high which may be indicative of the current job market. I doubt that this would have been the case at the height of the tech hiring frenzy in 2021–2022, so we are fortunate in this regard.

The success rate has dropped significantly in comparison to our previous assessment, and of the candidates that do pass, clear differences between them emerge. Also, candidates find completing the assessment to be fulfilling, and see it as a valuable learning experience as per their feedback. This is encouraging to see as one of the implicit goals of the new assessment is for it to not feel like a waste of time on the part of candidates.

I would encourage other ML engineering teams to move towards a similar assessment even though it may be a bit of an investment up front compared to using some online source of standardized questions. Lastly, I would encourage candidates to keep the skills described throughout this post sharp and not overfit to any single dimension along which they may be evaluated. Unless of course they know for sure that a particular organization uses exclusively Leetcode style questions for example. This extra effort made by both parties can lead to better hiring decisions for engineering orgs, choosing the right company for candidates, and a better equilibrium overall in terms of everyone finding what they are looking for.