Ep67 Ensuring Efficacy & Ethics in Psychometrics Artwork

The Chief Psychology Officer

Exploring the topics of workplace psychology and conscious leadership. Amanda is an award-winning Chartered Psychologist, with vast amounts of experience in talent strategy, resilience, facilitation, development and executive coaching. A Fellow of the Association for Business Psychology and an Associate Fellow of the Division of Occupational Psychology within the British Psychological Society (BPS), Amanda is also a Chartered Scientist. Amanda is a founder CEO of Zircon and is an expert in leadership in crisis, resilience and has led a number of research papers on the subject; most recently Psychological Safety in 2022 and Resilience and Decision-making in 2020. With over 20 years’ experience on aligning businesses’ talent strategy with their organizational strategy and objectives, Amanda has had a significant impact on the talent and HR strategies of many global organizations, and on the lives of many significant and prominent leaders in industry. Dr Amanda Potter can be contacted on LinkedIn: linkedin.com/in/amandapotterzircon www.theCPO.co.uk

All Episodes

The Chief Psychology Officer

Ep67 Ensuring Efficacy & Ethics in Psychometrics

November 11, 2024 • Dr Amanda Potter CPsychol • Season 3 • Episode 67

Send us a text

Unlock the secrets of psychological assessments with our esteemed guest, Ben Williams from Sten10, as he joins Dr Amanda Potter, CEO of Zircon, for an in-depth discussion on the tools that shape development and recruitment. Discover why validity and reliability are not just buzzwords but essential elements that ensure these assessments accurately measure what they're supposed to. Together, we unravel the intricacies of state versus trait-based assessments, such as resilience and emotional intelligence, and debate the delicate balancing act between inclusivity and fairness in assessment design.

Journey with us as we dissect the methodologies behind personality assessments, from the often-debated graphology to the impact of gamified assessments on candidate experience. Learn how to navigate the tension between questionnaire length and psychometric robustness without compromising the candidate's perception of fairness. We also explore the challenges neurodiverse candidates face with situational judgment tests and the importance of utilizing normative approaches to make assessments more inclusive.

Finally, brace yourself for a deep dive into the intersection of innovation, ethics, and technology in psychological assessments. Witness the ongoing debate over the use of AI in personality assessments and the ethical considerations surrounding social media data in recruitment. This episode is a testament to the transformative power of psychological assessments, underpinned by scientific rigor and cultural sensitivity, as they continue to evolve in our rapidly changing world.

Episodes are available here https://www.thecpo.co.uk/

To follow Zircon on LinkedIn and to be first to hear about podcasts, publications and news, please like and follow us: https://www.linkedin.com/company/betalent-by-zircon/

To access the research white papers mentioned in this and other podcasts, please go to: https://www.betalent.com/research

For more information about the BeTalent suite of tools and platform please contact: Hello@BeTalent.com

Speaker 1: 0:05

In this episode of the Chief Psychology Officer, we will be talking all things psychological assessments for development and recruitment, personality profiling and whether they live up to their promise of doing what they say they'll do. I'm Caitlin, senior Consultant and Business Psychologist at Zircon, and today I'm here with Amanda Potter, ceo of Zircon, and we have the pleasure of hosting Ben Williams from Sten10. Hello Amanda, hello Ben, hi Caitlin.

Speaker 2: 0:32

Hello.

Speaker 1: 0:33

Well, why don't we start? Ben, would you mind taking a moment to introducing yourself to our lovely listeners?

Speaker 2: 0:38

Be happy to. Yeah, real pleasure to be here. I'm an avid listener, amanda, and I think it's a really great podcast. So I'm a chartered occupational psychologist. I've been in the trade for 24 years and every one of those is etched onto my face. I currently run a company called Sten10. We design bespoke psychological assessments. I'm also a former chair of the Association for Business Psychology. I'm currently on the board as head of advocacy, so it's my job to convince people that business psychologists are a valuable asset. So hopefully I get to do some of this on this podcast. Pleasure to be here.

Speaker 3: 1:10

And we are, aren't we? We are definitely a valuable asset. I love that. Why don't you just start by explaining where Sten10 came from?

Speaker 2: 1:19

It's a very short name but with a very nerdy background. So as psychologists we like to plot psychological characteristics, like personality, on a scale and how we do this. If you look at something like extroversion, you have a few people that are very extroverted, you have a few people that are very introverted, but most people come out somewhere in the middle. And when we plot that on a graph it looks like a bell curve. And if we want to do whizzy things like add together people's scores on different personality scales, what you need is a way of measuring where someone lies on that bell curve. So a Sten 10 is the top 2%, so the most extreme end of the spectrum. What I didn't realize when I named the company that is that there's also a cartoon character called Ben 10. And I have been introduced at conferences as Ben 10 from Stenton, but I'm definitely Ben Williams from Stenton.

Speaker 1: 2:08

What a great coincidence. Thank you, Ben, for joining us. Apart from your clear credibility and experience in the space, Amanda, what prompted you to invite Ben on the podcast with us?

Speaker 3: 2:18

today.

Speaker 3: 2:18

So Ben and I have known each other I would say at least a decade now, and I trust Ben, frankly, because we bring him in as a consultant to help us with the validation of our B-Talent questionnaires and also the preparation of them for the BPS verification process, and so he helped us get decision styles verified and is currently helping us in the final stages of the strengths application for verification and also the psychological safety one, and the plan is to final stages of the strengths application for verification and also the psychological safety one, and the plan is to have all of the B-Talent tools verified over time.

Speaker 3: 2:51

So I trust Ben. That's one reason. Actually, he and I disagree sometimes, which is good, and we see things from quite different perspectives. And whilst I don't want this podcast to be too nerdy about validation even though we are all nerds on the call Actually what I wanted to think about is about the application of assessment tools and what do we need to check that they do and that we shouldn't do when we're using them. So it's really just a chat about psychology and science and validation, but hopefully a little bit from a lay person's perspective and why it's important.

Speaker 1: 3:24

Well then, it might be a nice place to start, for maybe people that are less familiar with the topic around validation and reliability is what do we mean when we talk about validity and reliability, define that for us when it comes to psychological assessments, and actually how do they differ?

Speaker 3: 3:40

It's a great question, Caitlin, because I used to teach reliability and stats and validity at master's level and I would say to the students which is the most important, Is it reliability or is it validity? And so many people would say reliability. So reliability assesses consistency, both within the questionnaire or over time, whereas validity looks at whether the questionnaire or tool or assessment of the interview is actually measuring what we want it to measure. Is it accurate? So if you say reliability, what you're saying is it's okay to consistently measure the wrong thing. So therefore, we would really want to check that a tool is firstly valid and secondly consistent. Would you agree, Ben?

Speaker 2: 4:25

This might be our first point of difference, amanda. I mean only really from like a nerdy perspective. So if a tool gives you different answers every time you complete it so I'm thinking specifically around test retest reliability it's very hard for it to have good validity because if it's affected by the time of day that you complete it, or maybe how stressed you're feeling, or maybe which examples you call to mind when answering it, it would be quite unreliable. It would give you different results. And when we look at how well a test predicts future performance, it's a correlation between how you've answered the questionnaire and your job performance a year down the line. But if it's different every time you sit it, then it can't be valid. It's a boring prerequisite of validity. I would describe it as and of course there are different types of validity as well, and some of them have nothing to do with reliability, so we can talk through those as well.

Speaker 3: 5:18

In other words, we need both. I used to always say you've got to have accuracy first, but you're right, you can't be inconsistent with that measure. And that made me think about the difference between state and trait, which is a conversation we've had on this podcast before, because if we think about resilience questionnaires, emotional intelligence questionnaires, they're more of a state-based approach and even our psychological safety one we're just submitting, as I said, to the BPS for verification, and Professor Adrian Furnham has commented in our technical manual around the fact that it is a state-based tool, not a trait-based tool, because our experience of the environment and psychological safety varies in the morning or the afternoon, depending on what's actually happening around us.

Speaker 3: 5:59

Therefore, the reliability is actually going to be lower on a tool like psych safety, or even on a tool like resilience, because our experience changes quite rapidly, whereas a trait tool should be more stable and enduring.

Speaker 2: 6:14

I mean, I think this is something that's quite criminally under-investigated in our world. I would love if I had unlimited time and resources to map out the different state-based influences on our personality or how we're feeling at any one time, because, you can imagine it, people will say, well, actually, yes, I'm quite extroverted when I'm around people that I know, but I'm quite introverted when I'm around strangers. So actually having a taxonomy of influences on people's personality I think would be really interesting. So maybe that's another one for a future lineup, amanda, we can talk about.

Speaker 3: 6:48

Indeed, it's so interesting you keep using the word personality, because I'm not pro personality assessment as a rule, and that's why we've gone down building a suite of tools that complement personality, but we haven't got any personality questionnaires in our arsenal. We're a strengths business, resilience teams, psych safety and so on, whereas you're quite big on personality. Is that right, ben?

Speaker 2: 7:09

Yeah, I guess from my early days at a certain big psychological consultancy firm we had it drummed into us that personality is someone's preferred way of thinking, feeling and behaving at work. So it's quite a broad definition. So if someone's preferred way or typical way of feeling at work, that would encompass things like emotional intelligence, it would encompass things like resilience, so it's quite a broad brush. But yes, I mean taking that as a foundation for predicting how good they're likely to be at these certain things. I think is a valid way to look at things. But I don't know if that's where your difference perhaps lies, amanda.

Speaker 3: 7:47

Well, that's where I think where I have the greatest problem and I think is the most contentious about the use of personality questionnaires. So I know you worked for one of the biggest test publishers to build one of the most widely accepted personality questionnaires in the world, and the issue I have with it is not the tool itself or the questionnaire itself, it's the application of it. Very often we hear and speak to our clients about how they are using personality questionnaires in a profile matching, a way to either assess fit for recruitment or to assess potential, and they work with those organizations to look at the correlation relationships between personality and performance and therefore say as a result of this, we can predict who is going to be successful in the role on the basis of their personality. Therefore, you should either recruit people with this type of personality or you should promote people with this type of personality, because it will give you a predictability.

Speaker 3: 8:43

The issue, of course, is cognitive diversity. Everybody's different. Caitlin and I, even though we're psychologists, are crazily different, as are you and I, ben, and I think it's the combination of the three of us that brings the innovation and the challenge. So it's not the actual personality questionnaires per se that I probably have a problem with. It's the application and it's the desire for the test publishers to sell as many as they possibly can, and the easiest way to sell them is to sell them in a simple way.

Speaker 2: 9:11

Yeah, it's to kind of promise the earth, and I think we're probably aligned on that point though, because I mean I've worked with a number of clients that have recognized that recruiting to a template actually gives them competitive disadvantage.

Speaker 2: 9:25

They say well, I think back to the earlier days when we recruited people from a really broad range of career types and degree disciplines. I think we had much more diversity of thought. If I was using a tool like that aforementioned personality measure, I would have danger zones in inverted commas, which probably isn't kind of the most inclusive term, but it says look, actually all three of us are quite different personality wise, but if all three of us were Sten one on one of the scales called behavioral, which is all to do with whether you have an interest in other people and what makes them tick, we probably wouldn't be great psychologists maybe. So actually, the danger zone on the behavioral scale might be Stens one, two and three, but anyone above that. You can have a pretty broad spectrum and you do that for each of the other relevant scales. Do you know?

Speaker 3: 10:10

what. That's a really good point, I think. Really it's about education, then, isn't it? It's really about education of the client and helping them understand how to use the tools, but it's also about the ethical use of our questionnaires, and that really reminds me of something Sarah said to me, because she showed me the B-Talent strategy, and it's all about the ethical use of the tools, and I queried the word ethical, but it really fits now with the conversation we're having, with not overselling the tools.

Speaker 3: 10:37

I was talking to Julie Lee this morning about us recording this podcast and I was saying the worry I was having about us recording this is that we get too specific and too narrow in our thinking around psychometrics. And we were talking about actually we have to validate everything in society and in real life all the time to make sure that we're making the right choices about any tools or things that we're buying. And she was saying about cars, about how cars now keep you in the lane or help you to park and they sense braking in front of you and it takes away the attention of the driver because we overly rely on the car. And if I link that to the use of questionnaires or tools for recruitment, for development, for the identification of potential, I think we're at risk of doing the same thing. There's certain tools that type you too much, to the point that it almost tells you what the answer is, and that's why there's an issue with using things like graphology and handwriting, and that's my issue with it.

Speaker 2: 11:36

Yeah, and I think, like if I drive my wife's car, it has blind spot monitor, so it warns me if I indicate in the motorway and there's a car just on my blind spot and then I jump into my car.

Speaker 2: 11:45

It has blind spot monitor, so it warns me if I indicate in the motorway and there's a car just on my blind spot and then I jump into my car and it doesn't have that and you're expecting the guardrails to be there. And I think, with the use of personality questionnaires, I mean, I've noticed more and more access to personality questionnaires being granted without any training being required, and the argument is well, the expert report that's generated from the click of a button does all the work for you, so you don't really need to understand how personality questionnaires are put together or the risks and pitfalls, because it's all written in a way that's ready for the layperson. And that is a slight worry. And I think, going back to your point, amanda, about the motivations about selling these tools, yes, we are all commercial organizations, but we've just got to tread that line between being very commercial but also protecting people's psychological well-being, because personality questionnaires are very personal by their very definition and you can do damage if they're not used appropriately totally true I love the car analogy.

Speaker 1: 12:38

I love how you kind of compared it. I think it makes it feel really realistic in that sense. But I suppose it's a good point maybe to therefore talk about ben. Why do you believe it's important that we should be thinking about validating products then?

Speaker 2: 12:49

I guess this might lead us down another tangent instantly, caitlin, which is what do we mean by validity? Because I think there are different reasons for different types of validity being important. Quite often. I look at validity as being a bit like a ladder where you can array the different types of validity in order of their power. So, going back to what Amanda was saying about graphology, I mean at the very lowest rung of the validity ladder we have things like faith validity.

Speaker 2: 13:18

So it's a selection method that's used because people just believe it works. It doesn't look like it should work, but they say I believe in it. So things like handwriting analysis, graphology, things like star signs, horoscopes, horoscopes, tarot yeah, tarot bumps on the head, phrenology, those China busts you can get that say this part of the brain's for this, this part of the brain's for that, and this has all been investigated. And unfortunately, libras don't necessarily make wonderful teammates because they're so caring and balanced Apologies to any astrology fans on the podcast. So faith validity is the lowest and I would say that, yeah, on its own it's not going to be legally justifiable and you could end up doing a fair amount of damage and exposing your company to risk.

Speaker 1: 14:03

So when you're talking about graphology, do you have any examples or stories related to that?

Speaker 2: 14:08

Yeah, well, like Amanda, in the earlier stages of my career I trained people on different assessment methods, and if I'm going to rubbish something like graphology, then I need to understand what graphologists tend to say. And so I was looking through this Graphologist 101 handbook and they were saying things like if your letters are very sharp and pointed and diagonal, that it means you're a go getter. If you do big, fat, round bottoms to your G's and your Y's, that it means you're intrinsically lazy. And then, surprisingly, if you color in those loops on your letters, then it means you're a sexual deviant. Goodness, interestingly, graphology is actually slightly more predictive of your personality in France than it is in the UK in some of the studies, and the reason for that is thought to be because in France they have a very standard way of saying what each of the letters look like. This is what an A looks like, this is what a Y looks like, whereas in the UK it's far more variable. So if you deviate from what they've taught you at school from the age of five, what a letter Y looks like, it says more about your personality than maybe it does in the UK, where it's a bit of a hodgepodge.

Speaker 2: 15:16

Interesting, the next step on the ladder is face validity. So does it look appropriate for the job? And, on its own, using a test that looks like it's assessing personality isn't enough. But when you compare it to some of the other methods that are around at the moment, so there are some gamified assessments that actually don't have great face validity, because you're essentially playing a game and it's inferring your personality from that. Now, that isn't a problem in terms of its predictive power, because some of these games do predict future performance, but in terms of the candidate experience, you really need to explain why they're being asked to sit this, because you could quite rightly say how does this relate to my performance in the job?

Speaker 3: 15:57

And you mentioned when we were talking earlier about the length of time it takes to complete a questionnaire. That's also relevant here, isn't it? Because we are encouraged by our clients as much as possible to shorten the length of time to complete the questionnaires. And I know, if you think about OPQ or some of the other really in depth personality questionnaires and NEO2, they can take some time to complete 45 minutes and also there's two different types of questions in OPQ, which makes it even harder, so they can take some time to complete. And if you're completing multiple tools as part of a recruitment or a development process, it could be quite arduous for the individual completing hundreds of questions. Our questionnaires are between 15 to 25 minutes in length and that's been the aim to try and make them as robust as possible. We don't seem to get any challenge that they're too short. I don't think we've ever really get challenged they're too short. We only get challenged if they're too long. But if they were really short, say maybe three minutes in length, that would impact face validity, wouldn't it, ben?

Speaker 2: 16:56

It would definitely impact candidate perceptions of whether they'd been treated fairly or not.

Speaker 2: 17:02

I mean there's a tool out there called the BFI which is an academic instrument the Big Five Inventory and there's a 10-minute version and there's a two or a three-minute version and it's got reasonable psychometric properties. However, another study that a couple of years ago now looked at candidates' perceptions of whether they'd been treated fairly with these very short measures and they felt they really hadn't had a chance to shine or to showcase their full complexities of their personality with a very short tool. So it's a bit of an unintended consequence of this race to have the shortest questionnaire possible that actually candidates might feel less well treated. So yes, I think 15 to 20 minutes is probably the sweet spot. I think the first one I was trained on was a 366 item questionnaire that was an adjective checklist and it took about an hour to complete and it's probably the sweet spot. I think the first one I was trained on was a 366 item questionnaire that was an adjective checklist and it took about an hour to complete and it's like, no, they're just collecting your data.

Speaker 2: 17:51

Yeah.

Speaker 3: 17:52

I'm going to go off track before we go to the next type of validity, which was the difference between obsessive and normative questions. So normative questions, just for the listener, are those types of questions where there's a question and a rating scale. We often have a seven point rating scale on our questionnaires because we validated it. Others seem to have four or five and it's a difference between whether you want a midpoint so you want to force people away from the midpoint by having even number of points, or anyway, I'm getting way too technical.

Speaker 3: 18:20

There was a really interesting aside to this research because we have moved towards only having normative questionnaires, so not ipsitive, which is forced rank, where you have to rank statements in order of preference. And then, when we looked at the neurodiversity research, there is some research to show that neurodiverse so non-neurotypical people actually don't enjoy completing Ipsative questionnaires. So the forced rank questions. They are more comfortable with those normative questions. So when you're thinking about the questionnaire for your organization, if you want to be inclusive, then you are probably encouraged to use normative questionnaires rather than Ipsative questionnaires, and I know some of the leading products on the market are both. So an interesting element that's come into the literature in the last few years.

Speaker 2: 19:12

I'd be interested to have a look at that, because everyone hates Ipstiff questionnaires.

Speaker 2: 19:16

It seems like they're never greeted with joy when it's which, most and least, it's like oh, do I have to do this? But yeah, I mean we've had a similar thing on the neurodiversity front around hypothetical scenarios, so having them in a situational judgment test. So imagine that you are in this person, take their perspective, try to infer what this other person might be feeling and choose the most appropriate answer. And if phrased in that kind of way, that kind of imagine this artificial persona you need to adopt, that's actually quite problematic as well from a neurodiversity perspective.

Speaker 3: 19:51

I mean some of the things we took for granted around the build of situational judgment tests and around the use of Ipsitiv. I think we can't make the same assumptions that we used to make Now. We are becoming more aware of individual differences and needs.

Speaker 1: 20:06

Ben, would you like to go back to talking about some of the other validities?

Speaker 2: 20:10

Yes. So the next one is a relatively recent one, or at least it's relatively recently been shared at conferences that I've been to, and it's called experiential validity. It's typically used for personality tools that look at type rather than trait, so essentially personality tools that say, this collection of letters or this color defines broadly who you are as a person. Now, the research behind these tools is often a little mixed from a psychometric perspective, so sometimes the clusters don't always hold together psychometrically, sometimes they don't always predict performance in a year's time.

Speaker 2: 20:48

However, from an experiential perspective ie how does the person completing it feel it can be incredibly insightful. So actually having someone saying that, look, your responses to this questionnaire show that you are likely to be this type of person, what impact does that have for your relationships at work? How does that impact how you conflict with some people and complement others? And actually it can have a whole load of light bulb moments that maybe wouldn't have been there otherwise. So I think sometimes that's used as another measure of success for a personality questionnaire is how many light bulbs does it trigger?

Speaker 3: 21:22

That's really cool. I've never heard of that, and that's exactly what we're trying to achieve, isn't it, caitlin? With the new classification of the team questionnaire that we're working on, we've just refactor, analysed the results of the team questionnaire to look at how each of the team preferences, like entrepreneurial, pioneer, scientists and so on, come together into a classification or a group, and then we're trying to interpret it. But it's really we're trying to create those opportunities for aha moments and those light bulb moments. So that's a brilliant type of validity I've not actually heard of before.

Speaker 1: 21:53

It helps it stick, it goes back to that stickiness.

Speaker 2: 21:56

Absolutely, and I think it is in team environments where I've seen the most light bulbs going off. So I was using one of the type tools with a team that had experienced a lot of conflict and having just an understanding of individual differences, an appreciation that there's no right or wrong with personality, it's just different strokes for different folks, then it could really help to resolve it. What's the next one, ben? The next form of validity. This is where things start getting a little bit more meaty, in that this is the lowest form of validity that's legally defensible. So if someone were to say, I don't think your assessment process is fair, you need to have content validity in order to be able to defend it from those accusations. And content validity is where you can prove that the choice of the tool and the choice of the specific scales within that tool are based upon job analysis, upon a really in-depth understanding of the job requirements by speaking to job incumbents, observing them, surveys, speaking to managers, etc. So that's a baseline level for validity.

Speaker 3: 22:55

So if you think about a tool like ours, then if you think about psychological safety or you think about strengths or resilience, it's about making sure that questionnaire is actually measuring psychological safety or is actually measuring resilience.

Speaker 2: 23:08

That's more construct validity. So the content validity is more like resilience, and psychological safety is important in this job, because if it isn't, and it's just a convenient way of getting the numbers down, then it's got poor content validity. So I think quite a few companies have been guilty of that when they mindlessly use verbal numerical, logical reasoning as a way to get the applicants from thousands down to tens for assessment center. Just thinking well, everyone needs to have a bit of numerical reasoning, right, but actually it's not based upon the job description or anything do you know?

Speaker 3: 23:40

I always get those two confused. That's brilliant, thank you. I'll have it on the podcast now. I'll know forever.

Speaker 1: 23:47

Would you say. Content, then, is almost like the justification or the journey before you get to the construct.

Speaker 2: 23:54

Yes, I mean essentially you need to know what constructs are important, and job analysis gives you what's called content validity. But then construct validity it probably does segue quite nicely, because construct validity is giving that a little bit of teeth and saying, okay, you've got the qualitative evidence to say that this is important. Construct gives you quantitative proof in a way. So this is where there's a couple of ways of looking at construct validity. You can either say well, we're measuring a trait that other tools have measured before, and let's see if you've completed my new, whizzy, new psychometric test. Does it give the same results as the old traditional test that's been around for donkey's years?

Speaker 2: 24:34

The other thing you can do is if you've got a new model. So with some of our clients because we do bespoke psychological assessments they've got their own pet theory as to what makes a great leader or whatever it might be. We had an author last year who'd written a book on inclusive leadership, had identified, I think it was, six different factors that went towards making a really inclusive leader, and then what we could do is factor, analyze people's answers to the questionnaire and see if six clusters emerged. And if those six clusters emerged, it suggests that, yes, those constructs do really exist. They stand separately from one another and they're worthwhile commenting upon, so there's a couple of ways of construct validity, but that's really like.

Speaker 3: 25:15

The theoretical basis upon which it's based is sound we did the same thing for strengths and for safety. We've done it also for each of our tools, and that's something the BPS really wants from us to make sure that we have assessed that construct, validity, that the real basis of the tools sound. So we are getting really technical in this podcast now, in this podcast now. So I suppose then, before we go into the next step, I would love you to just talk about what should people who are buying psychometrics and, by the way, we don't call them tests, because there's no right and wrong answer to a personality questionnaire, by the way. So I really encourage all of my team not to use the words personality and test in the same sentence, because the test suggests there is right and wrong, but actually questionnaire suggests that there isn't.

Speaker 3: 26:02

But historically they've always been known as personality tests. I have a fundamental problem with that, frankly, but that's another thing. So I think let's go back to thinking about the listener. What should they be looking for with all the things we've talked about so far If they're to buy a tool, a questionnaire?

Speaker 2: 26:21

what should they be thinking about? I would ask can I see a copy of the test manual? Please Now be prepared. You're probably going to get something assuming they have one that's going to run to the dozens, if not hundreds, of sides long and it's going to talk all about the theory behind the test construction, all of the statistics that's been done to it, etc. Etc. So it depends how detailed you want to get. I would say the fundamental things you'd want to look for are, if we take reliability first of all, so it's consistency. It's measured on a scale from zero to one. As to how consistent it measures a certain trait, you are looking for that number to be at least 0.6, if not a bit higher. If it was one, it would be perfectly consistent. It always gives you exactly the same answer, no matter the time of day, level of stress or anything else that's going on.

Speaker 2: 27:12

And if it's not then essentially you get wildly different scores every time you sit it. So that's one Two validity. So we're actually about to talk about the final two types of validity, and that's also measured on a zero to one scale. So zero means there's absolutely no relationship between how you perform on this test and how you perform in the job via some kind of criterion measure. What we sometimes do is measure how people perform in their job right now with how they answer a questionnaire right now, and we look at the relationship between the two and the higher the better. We're looking at ideally at least 0.3, but sometimes for personality, you might needa few scales to get up to 0.3. The other way of looking at it is saying, well, time one and time two is saying, well, you sit the test or questionnaire today. You look at performance in a year's time and how strong is that relationship. But that's really the gold standard that not many tests have. But that's predictive validity is the best you can get.

Speaker 3: 28:10

That's a nice segue, then, ben, for you to just describe those last two types of validity.

Speaker 2: 28:14

So the final two types of validity come under the banner criterion validity. So it's how well does the test or questionnaire link to some kind of criterion measure that's important in the job? So these criterion measures can be hard or soft. And a hard criteria measure might be things like sales, because you can put a precise number on it how much do people sell? Who score better on this questionnaire or test?

Speaker 2: 28:39

It could be things like tenure. So when we work with high turnover environments like call centers, they want to say well, actually we want to reduce turnover and that could be measured in quite a concrete way. Or it could be a softer measure like manager ratings, which could be a little bit influenced by how the manager's own biases or opinions, maybe customer satisfaction ratings, things like that, and they can be a little bit easier to get. So that's a criterion validity study and we would assess that on a scale from zero to one, with 0.3 on that scale being about acceptable. One would be perfect prediction, as if we've got inside someone's brain and measured exactly their personality traits at a zero and it never happens.

Speaker 3: 29:23

It never happens, and if it does happen, you need to question actually if that's true, because it's probably not.

Speaker 2: 29:29

Yeah, that would be crazy.

Speaker 1: 29:31

The fact that you've just gone and you've listed all the different types of validity. I think that goes to show that there's a lot to think about. But it might be nice to now kind of go into talking. What are the practical steps to validating? Is it quite a lengthy process and what does that look like?

Speaker 2: 29:45

Yeah, it's a straightforward process in theory. That's what I'd say. So what you need is you need to get a pilot sample, so a group of people that are prepared to both take your questionnaire or test and also to provide you with some kind of job performance measure. We would typically advise at least 50 people, but if you're going for recognition by a professional body like the BPS, then you're looking at 300 people plus, and it's quite simple really. You get them to complete the questionnaire, you get the measure of job performance, whether hard or soft, and then you look at the strength of the correlation between the two, and the higher the better.

Speaker 1: 30:23

And I think, amanda, we've kind of signposted throughout. We've mentioned that we've done that with a few of our products, haven't we?

Speaker 3: 30:28

we have, and it's a massive amount of investment. And one of the things I was wondering about, ben, is that you know we've taken 18 years to do that decision and got it verified and and it wasn't a straightforward process. You're absolutely right In theory, the steps are clear, but it doesn't go like that. Strengths have taken 10, you know resilience, similar number, psych safety. Six years old, you know these tools. They take years to get them to the standard that we need them to be, to sell them and to be confident that they're doing the job that they're doing. Yet, Ben, you're building products for other companies and you're building them pretty rapidly. How are you doing that? Why is it taking me 10 to 20 years?

Speaker 1: 31:08

and you're doing them in such a short amount of time.

Speaker 3: 31:11

What are you doing that I'm not doing?

Speaker 2: 31:14

You're doing the whole shebang, which is a technical term. So you are preparing for the BPS a highly detailed technical manual that really explains and justifies all of the theory. You look at each individual question, how it's performing, as well as gathering data on adverse impact, et cetera, et cetera, which is all absolutely what you need to do if you're selling an off-the-shelf tool that's purchasable and will be usable en masse in a wide variety of contexts With our clients. There's a bit of a sliding scale.

Speaker 2: 31:45

So sometimes, due to time and cost considerations, what we're working on is, I would say, a combination of content and a little bit of construct validity with our bespoke assessments, because we don't have the time to wait 18 years or even six months sometimes, to say, actually, before we use this method of assessing people, let's wait to see that we can prove that it links to performance in two or three years time, because if we refuse to work with them until they've done that, they'll use something else, typically something that's less robust.

Speaker 2: 32:24

So they might say, right, fine, we'll go back to the interview. Then We'll go and take them down to Pret, or we'll do CV sifting and we say, well, no, don't use CV sifting, let's do a mini pilot. Let's make sure there's no obvious flaws with the test from a psychometric perspective and let's have a research plan to build upon that in future years as the sample size gets larger. Search plan to build upon that in future years as the sample size gets larger. So yeah, we take a graded approach but rely upon content validity in the early stages.

Speaker 3: 32:46

So I suppose that's made me think why would they come to us?

Speaker 3: 32:49

Why would they come to BeTalent or why would they go to Sten10? And I think that's a really good question, because I think they'd come to us if they wanted to buy something that's going to examine the specific constructs that we have researched and built tools around so resilience, like safety, strengths, team 360, and so on. So if you wanted to have a specific construct assessed, then you'd come to us. If, however, you can't find a tool that assesses the thing that you need assessed or it's specific to your organization and to the characteristics within your organization that you need assessed, or it's specific to your organization and to the characteristics within your organization then you need to go to Stenten and then, ben, you and your team can then build the bespoke tool for that organization that will be as valid and reliable as it can be, given the budget and the time that they give you to do that research and that design. But if you want something really robust that's standardized, then you'd go to a test publisher like us.

Speaker 2: 33:42

That's right, and I'd say that increasingly Stenton's clients want us to serve a couple of purposes with our assessment, and often that's providing a realistic job preview as well. So we work with teachers, social workers, to recruit for roles where actually some of the people going into some of these roles don't realize the highs and lows of the profession and some of the context in which they'll be operating. So we might be testing resilience, but we're also giving them a taste of some of the scenarios they're going to encounter and that can allow self-deselection. So they say I don't think I can handle that, or I'm going to look for a different type of teaching role in a slightly different type of school that might suit me better. So some of that's what we're doing as well, giving a two way assessment of fit.

Speaker 3: 34:25

I think that's really important. I remember them doing that with the ambulance 999 service as well, because that's quite a harrowing job and quite tough, and many others. That realistic job preview element is really important.

Speaker 1: 34:37

So Ben, when do you think test publishers have got it wrong?

Speaker 2: 34:40

Well, I think there's different ways in which test publishers can get it wrong. One of them can be perhaps the over-eager application of new technologies. So I think there's an adage, captain Kirk, that said just because you can doesn't mean that you should. There was some research relatively recently about how you can determine people's personality traits from their facial movements. So from a sample of talking, maybe for 15 minutes, you can then predict the big five personality questionnaires. So this was enthusiastically integrated into some people's hiring processes. But, surprise, surprise candidates hated it being told that please have your camera on. We'll be analyzing your facial twitches and using it to determine your personality traits and therefore whether you'll get to the next stage in the selection process.

Speaker 2: 35:28

It was terror inducing that seems bonkers yeah, I think the psychology behind it was all right, but I think now that use of ai was deemed inappropriate within the state of new york and actually it's not part of that publisher's product line anymore.

Speaker 1: 35:42

How long were they doing that for?

Speaker 2: 35:44

It was like a year or so.

Speaker 3: 35:46

So it's quite a long time, isn't it? Another one I can think of is social media likes, because I know there's an organization or two organizations that I can think of in the UK that at one point were trawling through people's social media and assessing their likes and their comments and using that to predict their profile and their personality and their employability from their social media. What's your view on that one, ben?

Speaker 2: 36:12

A lot of people might take a sneaky look at people's social media profile just to make sure that there's no skeletons in the closet before hiring people. Of course, I would not advocate for that because it will trigger all sorts of biases. But using people's social media profiles in order, as a proxy for their personality, is something that's been investigated and again, in some cases, has shown to be plausible. But of course, the big problem is that you're sampling people differently, so some people will have 50 different pages that they like on Facebook. Some people might like two, so the former you'll have far richer source of information to make inferences about their personality than the latter. Also, of course, I could see a trade growing up in fake Facebook profiles. Earn your dream job at such and such a bank. Just sign up here, and our Facebook profile has been guaranteed to get you an interview. So it's an uncontrolled environment as well, which I think is a problem.

Speaker 3: 37:06

What's interesting, though, is those organizations that we're thinking about. They created an industry doing that. They didn't they Of creating a way of assessing and profiling individuals on the basis of social media, which is just amazing. So I think the point is we can go too far with our innovation, our creativity. I think the point is we can go too far with our innovation, our creativity. I think we can become a bit eccentric, and we don't want to go too far beyond actually what the tool, the questionnaire, the test is actually trying to do.

Speaker 2: 37:32

I heard a story from a forensic psychologist talking about salivation levels in extroverts and introverts, the theory being that extroverts, because internally they're perhaps less stimulated than introverts, which is why they go out and meet people, they actually produce less saliva.

Speaker 3: 37:51

Ah, we have less dopamine, don't we, as extroverts, which we identified in the previous?

Speaker 2: 37:55

broadcast. Yeah, and there's a study where they were squirting lemon juice into extroverts and introverts mouth and they found that um, in response to lemon juice, extroverts produced less saliva than introverts. So they were saying this was a student was telling me. They got them to lick a bit of paper and the shorter the saliva trail, the more of an extrovert you were now again. You could use that as part of a job interview process, but you're going to get some funny looks.

Speaker 3: 38:24

That's a no for me.

Speaker 3: 38:25

We won't be including that in any of our tools. But that goes to show. You know, if we get science showing predictability, then we might make the wrong decision in saying, well, let's include it. I think Andrew Huberman and one of his podcasts I was listening to recently said if you look to find evidence for the relationship between two pieces of data, you will find it, but if you also look for evidence of the contrary, you'll also find it. So there's evidence for everything out there. It doesn't mean you should do it.

Speaker 2: 38:54

And there are some old personality questionnaires that essentially just threw a whole lot of questions out there, things like do you like Marmite or Bovril, which I think is a predictor of whether you're likely to suffer from depression in later life. And no one has a clue as to why there's a link there, just is because they've thrown enough data at it. On the plus side, those types of questionnaires are not fakeable unless you know what the answer is, but on the downside, of course, it looks totally inappropriate. There are questionnaires like would you rather appreciate the beauty of a sunset or an expert football strategy, and again there's no face validity, so they can get a bad participant experience.

Speaker 3: 39:33

So they predict something, but it's probably just one of those statistical glitches in life that happens to predict, but it doesn't mean we should use it, so that's a great point we love that you're listening to this.

Speaker 1: 39:52

It means a lot to me and the whole team, who put a huge amount of hours into this podcast. Each week, we release the show for free to help people improve their working lives by sharing the science of psychology and neuroscience. In return, all we ask is that you help us with our mission. If you know someone who would benefit from listening to this episode, please send them a link, and if you haven't already hit the follow button, then please do. Thank you and on with the show. So we talked a little bit earlier and I kind of wanted to go on to it because I'm fascinated by it. But you mentioned gamification and obviously there's a lot right now people talking about AI and the introduction to that, so it might be nice to debate that a little bit when it comes to psychological assessments. So, ben, do you have any thoughts on AI and specifically around the potential implications that it can have on candidate behavior or candidate experience?

Speaker 2: 40:43

Yes, so I think probably gamified assessments and AI are slightly different topics, because gamified assessments are essentially a new skin on a very old type of psychological assessment. I remember for my undergraduate degree on probably what was a BBC micro looking at some shapes and tapping left or right if they were the same or different to measure my fundamental cognitive processing speed, or something like that. Now the newer game-based assessments, or behavior-based assessments as they're sometimes called, seek to do the same thing but just in a far whizzier package. Same thing but just in a far whizzier package, and I guess the relationship between those and AI is because those types of tools are very hard to use AI to cheat on. So a traditional text-based test or questionnaire where you say looking at a pasture of verbal information and asking them to draw a conclusion off the back of it, chat GPT can do a reasonable job at that.

Speaker 3: 41:39

So is it more situational judgment tests and things, or reasoning tests or verbal tests? Is that where the issue? They could put that into ChatGPT.

Speaker 2: 41:48

I would say that a verbal reasoning test has a very definite right or wrong answer, a situational judgment test. The right answer is going to depend upon the company culture that you're joining. So some companies will want you to answer in a way, for example, that is very autonomous and self-sufficient, and other companies might ask you to be very collaborative and supportive and co-creating. So it's harder to fake an SJT than a verbal reasoning test. But again, it's not impossible if companies put their values up on the website and people can look into that, which is why we're trying to build in multimedia elements now as well. So it becomes another step more difficult for people to cut and paste something into chat GPT and say tell me the right answer.

Speaker 1: 42:28

What do you mean by multimedia? I don't know if that's a silly question.

Speaker 2: 42:32

No, it's a good question. So for a certain well-known online clothes retailer, we've developed a series of questions, some of them text-based, some of them show an email chain, others will be an audio clip of a call with a client, others will be little video clips. So it's trying to think of different ways of presenting the information okay, yeah, thank you and actually we're considering the use of ai for our tools.

Speaker 3: 42:55

It's only something we're considering at the moment is for one of our products, which is is our 360, because one of the things about 360s is that I think the reports can be quite arduous, and so we have multiple types of reports that come out of the Be Talent system for each of our products, and one of the reports is a summary report.

Speaker 3: 43:13

So it's a two page or three pager that you could potentially print and take into a workshop If you've got qualitative feedback from the questions, and you might have pages of comments from your raters. If you've had many, many raters, we are thinking about using AI to summarize the qualitative comments of those raters into a succinct paragraph, and that's as far as we're thinking about going. Sarah is very, very clear. Our policy, as we're an ISO business, we have a very clear policy that's been developed by the COO that we are not allowed to put any content of our tools or information from our clients into any AI product because of confidentiality. So it's a really interesting debate we're having internally because I think actually AI would help to generate some useful text in those summary reports and we could access it and use it, but should we is the question.

Speaker 2: 44:12

And I think that's where you need to use a closed system that doesn't contribute to the algorithm, and you need to place your faith in the provider that it really isn't going to support the algorithm in any way.

Speaker 2: 44:25

I've heard some really interesting things.

Speaker 2: 44:26

Recently I was talking to another test publisher who'd given that instruction to all of their associates and all of their employees do not feed our tool into chat GPT. But when he started interrogating it it knew the tool inside out, so it got out there in the end. But then I've also heard of this phenomena of data poisoning on AI, which is the deliberate seeding of misinformation in these large language models so that it actually gives incorrect responses over things that perhaps are intellectual property or maybe commercially important. It's an interesting one because we've had clients say to us we were going to use you to design an interview guide, but now we're going to pop it into ChatGPT instead and it will come out with a reasonable response. Now if someone were to poison the large language model to come out with terrible questions that asked you about what did you have for dinner this morning or what did you have for dinner last night, then you could safeguard a whole line of business around competency-based interview creation. Not that I would ever consider doing such a thing.

Speaker 3: 45:23

But, wow, that's a really interesting point, isn't it? That if you were to use AI to do it, of course, again agreed you'd want a secure, closed system if you were going to do it, because it would be coded into the technology. Yet there is the risk that someone might want to, in other words, poison the system. There is the risk that someone might want to, in other words, poison the system. Anyway, this conversation has been really to show that we don't take this job lightly, do we? I mean, I love psychology, I love research, I love the investigation of new areas and the building of new models and then going out researching, gathering data and testing that model against performance criteria.

Speaker 3: 46:02

But it's flipping hard work. It's really hard sometimes to get the tools to work. To be honest, we've got better at it over the few decades that we've been in business. The last tools that we've designed.

Speaker 3: 46:14

It seems that the validation process has been smoother with those than some of the ones that we designed in the beginning, or I designed in the beginning. Yet it's still a little bit rocky in places. So I think it's just to say that if you're buying a questionnaire or a test using Ben's language, whether it's personality or strengths or any of those tools and you're planning to use them for assessment, for recruitment, for the identification of potential or for development or even coaching. It's really important to make sure that they're working and they're doing the job they're supposed to be doing, and legal defensibility is really important. So make sure it's got some of those stats that Ben talked about, so it does at least the minimum, so that you can be confident the tool is doing the job it's supposed to be doing and for us to make sure that we recommend the tools that are the most valid and most suitable for that role.

Speaker 1: 47:06

Ben, what about you? What's the main thing that you've learned when helping organisations in this space?

Speaker 2: 47:11

I'd agree with Amanda that it's a significant undertaking. So when we design bespoke assessments for our clients that they want to add to their portfolio, often they say, can we also get BPS verification, almost like as an afterthought, and actually you don't realize quite how much extra work that is. We will go on that journey with you if you wish, but it's a multi-year endeavor. I think Zircon needs to be applauded for the amount of effort you go into making sure your tools meet these BPS standards, especially in an environment where some of the newer entrants aren't putting that level of rigor into it. This does come back to my other hat with the ABP as well. So advocating for business psychology means explaining in hopefully relatively layman's terms why you should bother.

Speaker 3: 47:59

So on this podcast, people that are working in the HR space hopefully now are a bit more familiar with the importance of validity and what kind of questions to ask in order to evaluate a test properly, because without people like us talking about it, then people don't know the value of asking those questions very, very true, actually, and I remember when I was doing my PhD for a major bank it was sponsored by a major bank, so we are talking 20 plus years ago, goodness and the job was to originally validate a leading personality questionnaire and three psychometric tests to see if they were predictive of future potential and future performance. And they weren't. In fact, there was a negative correlation on a few. So the truth is, tools work well in some environments and not so well in others. So there are some proper questions that we should be asking ourselves about the application of those questionnaires in certain environments so, ben, I suppose what would be for our listeners?

Speaker 1: 49:00

what would you say is a key takeaway that they can get from now? Having listened to this podcast for the last you know, over half an hour, what would you want to leave them thinking?

Speaker 2: 49:10

I mean, hopefully, that they are able to see through what in some instances can be a case of marketing fluff and know that underneath that lies some really important questions around the reliability and validity of a test that a good test publisher should be able to answer. What you need is not just one or two case studies from the publisher themselves to say it worked here, but you need an independent verification, ideally from a body like the British Psychological Society that agrees with that and has put the test through its paces and the test publisher themselves has invested the time and effort in gaining those credentials.

Speaker 1: 49:47

I think it's so interesting you said the word fluff as well because in my role now you know I'm a business psychologist but I don't do so much in the product space anymore as maybe five years ago, even when I was doing my master's, and obviously I'm having lots of conversations now with clients and generally psychology can be seen, as you know, fluffy. A lot of the topics are fluffy, but actually I think this podcast for me just kind of reinstates that actually you know there's so much scientific rigor that actually goes into these and adds value to the conversations that come off the back or the insights that you get products.

Speaker 3: 50:16

How amazing and what a privilege we have to go out, do the research, build these tools that people can then use around the world in order to create insight, in order to have great conversations. I mean it's amazing. One of our clients is helping us with the translation of the psychological safety questionnaire, so we've translated into Japanese and French. They've just sent us the tool in Chinese Chinese and then we're going to do Korean. The questionnaire, psych safety, for example, is being used around the world to help teams operate in a more effective way and to have better conversations.

Speaker 3: 50:50

And what a privilege I mean from our research, to be able to have done that and to now to have translated into the different languages and to have the investment from our client to say they want to use the tool and roll it out around the world. So, therefore, they will help us with the translations, which is just so cool, so fabulous, but it's the application that's the most important. I met with chris small funnily enough, from talogy, for lunch the other day and I was saying for me, my legacy, it's really about how can I give b talent wings, how can I help it fly and make a difference for people and for teams. That's the thing I'm really excited about Now. We've done so much to build and research and develop.

Speaker 2: 51:29

This isn't meant to. If that's the high that we go out on, and that's great on the podcast, but. I was just going to say that no, no, no, you can chop this out, but I'll just talk anyway. I've often thought that you've got like a dream job, amanda, like you're able to investigate all of these really interesting traits, and having the machinery of Zircon and all the great psychologists there as well to help go down these investigative paths, I think, is I'm very jealous, I'm very jealous.

Speaker 3: 51:59

You are right. Actually I have created the dream job and, in fact, Sarah and the team enable me to have it, because we have a team of research psychologists who work for us, who do the research for the podcast. They do the research for the products. Collectively, we enjoy the research and the I wasn't going to say the fruits of our labor, but you know what I mean.

Speaker 2: 52:18

A nerdy side thing is talking about global use of personality questionnaires. It's really important to validate it in different countries as well. I remember another test publisher that had a scale all to do with competitiveness, and in English one of the questions was I like to beat the opposition? And when it was translated into Japanese then they found that people were answering never, never, I never like to beat the opposition, and it's because it was translated as literally hitting the opposition physically. So it's really important to say well, not only do that reliability measures, but also does this predict performance in this environment as well? Because it could be different.

Speaker 3: 52:57

It's true. In fact, we've translated everything back again in those countries so that we can make sure that the language has resonated and has the same meaning. But anyway, thank you, Ben. Thank you for being a guest. Thank you very much.

Speaker 1: 53:10

Thank you both because I feel very lucky to be if we're talking about gratitude, I feel lucky to be sat on this podcast, yeah, again being enlightened by all this science that goes into it all. So, yes, thank you Ben, thank you Amanda, thank you very much, been pleasure, and thank you to our listeners. If you do like listening to us talk all things psychology and business then please do hit the follow button.

Speaker 3: 53:30

Thank, you, caitlin, and thank you ben, and thank you again everyone for listening. I hope you have a wonderful and successful day.

The Chief Psychology Officer

The Chief Psychology Officer

Ep67 Ensuring Efficacy & Ethics in Psychometrics

Dr Amanda Potter

Tim Hepworth

Caitlin Cooper