| Outcome measure | Data to be collected | Point of collection | Sample | Source |
|---|---|---|---|---|
| Primary: Progression to HE | Does the student enter HE in the academic year 2022-23 | After endpoint (June 2023) | Post-16 only | HESA |
| Secondary: Progression (access) to host university | Does the student go on to study at the provider that delivered the summer school. | After endpoint (June 2023) | Post-16 only | HESA |
| Control variable: Attainment | KS2 (Maths and English), KS4 attainment (Attainment 8 score) | After endpoint (June 2023) | Post-16 only (KS4 attainment) | NPD |
The National Pupil Database (NPD)
Introduction to NPD
The National Pupil Database (NPD) is a longitudinal administrative data source collected by the Department for Education (DfE) that hosts child-level and school-level data on all pupils in state-funded schools across England [DfE (2024b)]1. The NPD is the DfE’s primary source of information on pupils in school and is used primarily to monitor school performance and pupil characteristics, and to inform policy and research (DfE, 2022).
Overview of the data available in the NPD
| Population covered | Children in state-funded nursery, primary, secondary, and special schools, non-maintained special schools, pupil referral units, general hospital schools, and independent schools |
|---|---|
| Unit of observation | Individual-level and school level 2 (not class level). |
| Key variables | Child level variables such as gender, date of birth, ethnicity, postcode language, free school meal eligibility,3 special educational needs, disability, 4 social care status (looked after children and children in need),5 absences, exclusion, dates of joining and leaving a school. Individual level variables such as attainment at Key Stage 1, 2, 3, 4 and 5 (KS1, KS2, KS3, KS4 and KS5), and Information about what they did after they finished school. School-level variables such as the local authority of the school and the school name. Area-based variables such as the local authority area in which the child lives. |
| Years available | School census (2001/02 to 2025/26) Key Stage 1 (1997/98 to 2022/23)6 Key Stage 2 (1995/98 to 2024/25)7 Key Stage 3 (1997/98 to 2012/13)8 Key Stage 4 (2001/02 to 2024/25)9 Key Stage 5 (2001/02 to 2024/25)10 Absences and exclusions (2005/06 to 2024/25)11 Children looked after (2005/06 to 2024/25) Children in Need (2008/09 to 2024/25) |
Limitations and exclusions
EYFSP, KS1 and KS2 attainment data are missing for pupils who attended independent or non-state schools, as these schools do not submit statutory assessment returns.
The NPD school census only includes data on the number of pupils enrolled in independent and general hospital schools, but no detailed characteristics data. It also excludes data on compulsory-age children outside formal education, such as those who are home-schooled, missing in education, or too unwell to attend school. The DfE gathers these figures from local authorities and publishes them separately in the Elective Home Education (EHE), Education Otherwise Than at School (EOTAS), and Children Missing Education (CME) databases (Independent Provider of Special Education Advice, 2018; DfE, 2024a).
Access to the NPD
Access to the NPD is managed by the Department for Education (DfE). Due to the sensitive nature of the information hosted by the DfE, strict procedures are in place that limit access to the NPD. Researchers and organisations seeking to access data from the NPD must apply directly to the DfE data-sharing service. Data can then be shared through the Office for National Statistics Secure Research Service (ONS SRS). Therefore, those wishing to access the NPD must have an ONS Approved Researcher accreditation to access data via the ONS SRS (ONS, n.d.). In exceptional cases where an organisation can demonstrate a clear need, the DfE may agree to a direct data supply without ONS accreditation. A publicly available guide details the full application process (DfE, 2025).
Application of the NPD in HE
The NPD can be used in a variety of ways in HE research and evaluation - typically as a source of background characteristics before HE entry, prior attainment measures, measures of deprivation or a source for covariates in statistical modelling. Some examples are given below.
It is well known that poor GCSE attainment has been shown to limit access to HE (OfS, 2022). Sibieta, Greaves & Sianesi (2024) tested whether incentives (event-based12 or financial) could improve GCSE performance in Year 11 pupils. The NPD was used as a data source for the primary outcome measure (GCSE results in Maths, English, and Science), pupil characteristics, and school attendance of those in the treatment and control schools. The findings showed no statistically significant effect of financial incentives on GCSE performance. However, event-based incentives had a positive effect on Maths GCSE scores, particularly among pupils with low levels of prior attainment. The application of the NPD, in this example, enabled researchers to conduct a large-scale impact evaluation without reliance on new data collection (Sibieta, Greaves & Sianesi, 2024).
The NPD can also be linked with other datasets, such as those collated by Higher Education Statistics Agency (HESA), to explore how prior attainment and background characteristics (e.g., gender, school type, socio-economic status) influence the likelihood of enrolment in HE and the type of HE institution attended. For example, Rodeiro (2019) used the NPD as the source of data on A Level subjects and attainment, prior attainment (e.g., GCSEs), and background characteristics (gender, school type, deprivation). They linked NPD-HESA administrative data for use in a multivariate logistic regression model. The NPD-derived variables were included as controls to isolate the effect of A Level subject choice on HE participation (Rodeiro, 2019).
Self-reported survey data can also be linked with NPD records, but the access procedures for survey linkage is a specialised process, distinct from those for the NPD standard extract.13 For example, CFE Research (2023a) linked survey data to the NPD to assess the impact of a Uni Connect outreach programme on intermediate learner outcomes. These outcomes included knowledge of higher education, aspirations, self-efficacy, and intentions to apply to HE, which are key indicators along the causal pathway to HE participation (Thomson et al., 2022). The self-reported survey data was linked with the NPD data (prior attainment, FSM eligibility, gender, ethnicity and disability status) and tracking data from Uni Connect delivery partners (HEAT, AimHigher West Midlands, EMWPREP). Combining administrative and self-reported data gave the evaluation greater explanatory power and helped inform improvements in how the outreach programme was targeted (CFE Research, 2023b).
Background
In 2021, TASO collaborated with eight HE providers and the Behavioural Insights Team (BIT) to address the lack of casual evidence on the effectiveness of summer schools on widening participation in HE. To investigate this, a two-arm, parallel-group randomised controlled trial (RCT) was conducted to investigate the impact of summer schools on improving access to HE.14
The NPD was accessed in this evaluation to provide covariates15 in the analysis, as these variables may impact HE enrolment rates (Table 2).
Only the NPD-related control variables have been included in this table. Refer to TASO resources for the full list of outcome measures (TASO, 2023).
Process for accessing the NPD
To access the covariates required for the evaluation, the BIT completed the DfE Data Sharing Service application form.16 This involved:
Describing the projects purpose, research methods and public benefit.
Specifying all NPD datasets and variables needed, using the NPD Data Tables and the Find and Explore NPD tool, and submitting the completed official data tables document.
Submitting a matching request that identified the personal identifiers used for linkage, including the Unique Reference Number (URN) and Unique ID, to enable matching between HESA data, self-reported survey data, and NPD records.
Providing evidence of ethical approval.
Indicating the mechanism for accessing the ONS SRS.
Setting up a Data Sharing Agreement with the DfE (as TASO is a third-party data controller) and supplying the required inter-organisational agreements and Data Protection Officer details.
Process of linking NPD
For the linkage, BIT will supply the DfE with the personal identifiers required to match trial participants to their records in the NPD. These will include:
First name
Last name
Date of birth
Postcode
School identifier (URN)
These identifiers will be uploaded alongside the datasets collected for the evaluation, including:
Online survey data17 (attitudes, aspirations and intentions) collected by TASO
Progression to HE and progression to host university derived from HESA via the Higher Education Access Tracker (HEAT)18
Attendance data at the summer school
Randomisation group assignment
Application data indicating which summer school participants applied to.
At the same time, BIT will submit an NPD variable request for FSM eligibility, KS2 attainment and KS4 attainment. The DfE will then carry out the matching using a fuzzy-matching approach, which uses combinations of identifiers to locate the correct pupil record. Once the matching is complete, the DfE will remove all personal identifiers and generate a pseudonymised pupil matching reference (PMR) for every participant. This same PMR will be applied to the NPD extract and to the pseudonymised versions of the trial and survey datasets submitted by BIT. All of these pseudonymised datasets will then be made available to BIT within the ONS Secure Research Service. Since, PMR will be applied consistently across both the NPD extract and the submitted trial datasets, BIT will be able to securely merge all datasets using the PMR to create a fully de-identified matched dataset for the evaluation.
Ethical considerations
Informed consent: Students had to be informed about what data would be collected and how it would be used through a participant information sheet. Notifying the students that they would be tracked was particularly important.
Anonymity: Data provided from the DfE was pseudo-anonymised, with a PMR provided and identifiable data removed. This was to protect the student’s identity.
Data privacy: Students who did not consent had their data removed from the project and were not tracked.
References
Footnotes
Full variable listings are available in the NPD Data Tables (Excel), under the Data Tables section (GOV.UK, 2025)↩︎
NPD school-level datasets provide aggregated attainment and census indicators at the level of the school (one row per school). School-level datasets summarise outcomes such as KS2–KS5 performance, contextual characteristics, and accountability measures from the individual level.↩︎
The NPD contains numerous FSM indicators. The EverFSM variable provides a broader measure of disadvantage, indicating whether a pupil has ever been FSM-eligible within the last 3, 6, or all-year windows.↩︎
Disability is only present from 2010/11 to 2011/12. The primary way in which education systems record pupils who require additional support due to learning, physical, behavioural, sensory, or developmental needs is through the special educational needs (SEN) framework. SEN coverage from 2001/02 to 2025/26.↩︎
The NPD records whether a pupil is currently looked after or was previously looked after using variables derived from the Children Looked After (CLA) dataset and the Post looked after arrangements (PLAA) variable. These indicators identify both ongoing and historic care experience, including adoption, special guardianship, and residence orders.↩︎
NPD KS1 assessment data are unavailable for 2019/20 and 2020/21 due to COVID-19 cancellations.↩︎
NPD KS2 attainment data were not collected in 2019/20 or 2020/21 due to COVID-19. Data for 2021/22 are available but subject to limitations. ↩︎
NPD KS3 data appear in the NPD only from 1997/98 to 2012/13 because national KS3 tests were abolished in 2008, and the remaining teacher assessments ceased to be collected after 2012/13. ↩︎
NPD KS4 amended data are unavailable for 2019/20 and 2020/21 because exams were cancelled. Unamended and final datasets exist but rely on teacher-assessed grades, so they must be used with caution and are not comparable with other years. ↩︎
NPD KS5 amended data are unavailable for 2019/20 and 2020/21 because exams were cancelled. Unamended and final datasets exist but rely on teacher-assessed grades, so they must be used with caution and are not comparable with other years.↩︎
Exclusions data captured from 2001/2 to 2004/5 at a school level. From 2005/06 it is at an individual level.↩︎
Event-based incentives refers to a non-financial incentive reward in the form of a trip or group-event, chosen by pupils in the year group at the start of the school term.↩︎
Linkage of survey responses with NPD data involves obtaining consent from respondents, securely transferring key identifiers (forename, surname, date of birth, postcode, and sex/gender) to the DfE, and the DfE then carrying out the matching. A combined dataset of survey responses and the requested administrative data is created, anonymised, and securely returned to researchers (DfE, 2016). ↩︎
An interim impact and analysis report on the summer schools trial can be found on the TASO website (TASO, 2023b). The final impact report is expected in 2026. ↩︎
Covariates are variables that researchers include in an analysis to control for their potential influence on the relationship between the intervention and the outcome. ↩︎
See the application form and guidance document under the ‘Application Process’ section (GOV.UK, 2025).↩︎
Only the survey responses of trial participants who have provided consent for data sharing will be supplied to the DfE. Responses from individuals who do not provide consent will be excluded.↩︎
HESA data will not be made available until the Summer of 2025.↩︎