Multiple web accessibility assessments

There’s been an awful lot written recently about the accessibility assessments in Socitm’s Better Connected reports. Some of it has been… well, let’s just say that some of it has been less than accurate! So here’s a detailed overview of the process we use for carrying out multiple assessments for projects like Better Connected, with some of the background about how we developed this process.

There are three main methods of assessing the accessibility of a website:

  1. Automated testing
  2. Expert human review
  3. End user testing

Each has its benefits and drawbacks:

  1. Automated testing - Automated tools can analyse many more web pages in a given length of time than is possible in either of the other two testing methods, and can be used to help identify and locate certain types of content to aid in expert and end user testing. However there are a limited number of issues which can genuinely be tested using automated tools, and there needs to be an agreed set of defined standards which one can test against.
  2. Expert human review - Brings human judgement into the assessment, making it possible to take a balanced view of the relative impact on end users of any accessibility problems uncovered in the assessment. However the inevitable cost and resource limitations mean that only a relatively small sample of pages in a site can be assessed, and care needs to be taken, when carrying out assessments of more than one website, that the judgements made are consistent and sensible.
  3. End user testing - Can reveal accessibility and usability problems which may not be spotted by other testing methods. However it does require more time and resources than an expert review, and requires the participation of several “end user” testers, ideally with a range of different disabilities and technical skill levels, along with one or more experts who can evaluate the problems encountered by the end user testers when using the site and the comments made by the testers, to draw out the real issues and avoid recommending changes based solely on the personal preferences of the testers.

When assessing the accessibility of a website, the ideal would be to use a combination of all three methods. However, when it’s necessary to carry out multiple assessments over a short period of time (usually to get an overview of the state of web accessibility in a particular sector), this ideal, in-depth methodology isn’t feasible. As a result, multiple assessment projects are often handled by limiting the assessment either to the use of an automated tool, or to a human expert review of just the home page of the site. Both of these methods provide some useful information, but each has clear limitations which tend to devalue the results of the assessments and any conclusions drawn from them.

Combining automated analysis with manual assessment

Several years ago, we started to look at how we might combine automated testing with human expert review, to expand the scope and validity of this kind of multiple assessment. The initial focus was to enable us to expand the number of assessments we could carry out for Socitm’s Better Connected reports, but we’ve used the same methodology on other projects which call for multiple assessments over a limited time period.

Since the most time consuming element in a combination of this kind is the human review phase, we looked at how the data from the automated testing phase might be used as a filter, with some data elements being used to determine whether or not a site proceeds to the manual testing phase. We wanted to find a balance between minimising the time required to assess each site and maximising the benefit of having a human review phase.

The assessment sequence

When testing compliance with WCAG1.0 at both level Single-A and level Double-A, the process has several stages:

  1. Automated testing of a designated number of pages on all of the sites involved in the assessment.
  2. Analysis of the data produced by the automated assessments, to determine which sites should go forward for manual assessment at WCAG1.0 level Single-A.
  3. Manual testing to level Single-A of those sites which pass certain level Single-A criteria in the automated tests (these are detailed later in this article).
  4. Further analysis of the data produced by the automated assessments in combination with the results of the manual assessments, to determine which sites should go forward for manual assessment at WCAG1.0 level Double-A.
  5. Manual testing to level Double-A of those sites which pass certain level Single-A and level Double-A criteria in the automated tests (these are detailed later in this article) and which also pass the manual testing at level Single-A.

Quality and consistency

Throughout the various phases of this process, a senior consultant carries out spot checks of the automated and manual assessment data. The master spreadsheet used to compile all of the data has checks built into it which will flag up possible problems with the automated data, and is also used to review the collated assessment data for inconsistencies or anomalies. If any are found, the original data is checked, and if necessary, automated and/or manual checks are repeated.

Automated assessments

The tool: SiteSifter

For some years now, we have worked with a small consultancy based in Sweden - Greytower Technologies. Greytower specialise in accessibility issues, and Tina Holmboe at Greytower has developed a suite of tools which, collectively, she has called SiteSifter. SiteSifter works by first mirroring the required number of pages from the target site, and then analysing the page content locally. It can handle many different types of analysis, and can be configured to output a range of data in specified formats. For the purpose of the kind of bulk assessment we’re discussing here, we worked with Tina to define exactly what we wanted it to analyse, exactly what data we wanted it to output from that analysis, and the format we wanted to receive that data in.

The data set we currently use includes things like:

  • Number of IMG elements found.
  • Number of IMG elements found which lack an ALT attribute, with example URLs.
  • Number of BLOCKQUOTE elements found, with example URLs.
  • Number of HTML validation errors.
  • Number of deprecated HTML elements found, with example URLs.
  • etc.

How we use the automated data

Much of the data obtained from the automated analysis is used simply to direct and speed up the manual review, should the site reach that stage of the process. However, bearing in mind that WCAG1.0 is used as the basis for these assessments, there are some data elements which can be used to determine whether or not a site has failed to comply with a few specific checkpoints in WCAG1.0.


Level Single-A criteria:

  1. Images (IMG elements) with no ALT attribute.
  2. Image map hotspots (AREA elements) with no ALT attribute or an empty ALT attribute.
  3. Java applets (APPLET elements) with no ALT attribute.
  4. Frameset pages with no NOFRAMES element.
  5. Frames with no TITLE attribute.

Level Double-A criteria:

  1. Invalid HTML.
  2. No headings coded.
  3. No H1 headings coded.

Marginal allowances

A major drawback of automated analysis is that it isn’t currently possible to program an automated tool to assess the relative importance or impact of specific failures to comply with WCAG1.0 checkpoints.

For example, consider two sites, each with many images, only 1 of which on each site lacks an ALT attribute. An automated tool can’t assess the importance, in terms of the end user experience, of that missing ALT attribute. It will flag both sites as having failed to comply with checkpoint 1.1.

However if a person reviews each site, they might discover that on one site, the missing ALT attribute is on a small decorative image on an archived page which is 8 years old, while on the other site, the missing ALT attribute is on the link on the site entry page which leads into the site itself. It would clearly be ridiculous to fail the first site purely on the basis of that one missing ALT attribute if everything else is OK, whereas it’s a different matter in the second site, given the relative importance of that image link to some users’ ability to use the site. That’s the kind of decision that only a human can make.

So, when determining which automated data could be used to determine that a site has failed in compliance with WCAG1.0, we built margins into the logic used to analyse the data. These margins don’t arbitrarily assign a “pass” to a site. In our data, they are signalled as “marginal” passes, and all they do is try to ensure that a site is not unfairly failed on the basis of tiny numbers of failures showing up in the automated data. Instead, these sites go through to the manual assessment phase, where a human can then make a more balanced judgement.

For example, if more than 5% of the images found in the 200 page sample tested on a site have no ALT attribute, it’s an outright fail. But if the number of images without an ALT attribute is less than 5% of all the images found in the 200 page sample, or less than 10 images on sites which have very few images, then that is flagged as a “marginal” result, so that, if everything else is OK, that site will go through to the manual assessment phase where a human auditor can assess the situation. Similarly, if more than 50 HTML validation errors are found, it’s an outright fail, but if fewer than 50 HTML validation errors are found, it’s a “marginal” result, and the site has a chance of being reviewed by a human auditor.

The margins which are set for each checkpoint are based on the knowledge and experience we’ve gathered from 7 years of auditing websites, and reflect our awareness of the real life issues faced by web teams.

In the final reported results, we don’t differentiate between sites which pass automated tests without needing to use these marginal allowances and sites which pass because of these allowances. However, when carrying out the manual assessments, we do note where a checkpoint has been passed because of the marginal allowance, and check that particular element for importance within the site, and the impact of any individual failures in the automated tests. If necessary, we may change the marginal pass to a fail as a result of that manual inspection.

Even the best sites have imperfections and occasional small lapses. This use of these “marginal allowances” is an attempt to accommodate that fact, and to maximise the likelihood that such a site, which might otherwise fail the automated testing phase, will undergo a more balanced, human inspection.

Manual assessments

For the individual manual site assessments, we use a spreadsheet with a series of questions evolved from the checkpoints in WCAG1.0 and our knowledge and experience of assessing websites. This helps to ensure consistency when several auditors are assessing large numbers of sites.

The spreadsheet contains the data from the automated assessment of the site. This provides the auditor with background information on the assessment, and also with example URLs for specific types of content (such as applets, image maps, PDFs, etc) if it was picked up in the automated analysis. This cuts down on the time the auditor needs to spend looking for specific elements as part of the assessment process.


Examples:

Image ALT text:

The spreadsheet shows the number of images found, plus the number and percentage (if any) of those images which lacked an ALT attribute, and the example URLs for those images. The questions are:

  1. “Where ALT text is provided for images, is it meaningful and appropriate?”
  2. “Are the images which lack an ALT attribute all decorative or spacer images?”

Use of BLOCKQUOTE and Q:

The spreadsheet shows the number of BLOCKQUOTE and Q elements found, and the example URLs for those elements. The questions are:

  1. Have the BLOCKQUOTE elements been used appropriately (i.e. for actual block quotations rather than purely to obtain a visual formatting effect)?
  2. Have the Q elements been used appropriately (i.e. for actual inline quotations rather than purely to obtain a visual formatting effect)?
  3. Did you see any other quotations on the website?
  4. Have these other quotations been coded properly as quotations using BLOCKQUOTE (for block quotations) or Q (for “inline” quotations)?

The first manual assessment phase covers the issues relevant to WCAG1.0 level Single-A. All sites which achieve a result of “pass” or “marginal pass” for the level Single-A criteria in the automated analysis undergo this phase of manual assessment.

The results from these manual assessments are combined with the automated data to determine which sites should go forward to the second phase of manual assessments. All sites which pass the manual assessment at level Single-A and which achieve a result of “pass” or “marginal pass” for the level Double-A criteria in the automated analysis go forward to the second phase of manual assessments, which covers the issues relevant to WCAG1.0 level Double-A.

In addition to answering the standard set of questions contained in the assessment spreadsheet, the auditor can flag up specific issues or decisions for a second opinion. If necessary, after discussion within the auditing team, an auditor can fail a site if they encounter an accessibility issue which isn’t directly addressed by one of these questions, but which presents a real obstacle for end users with regard to accessibility. In the few cases where this has happened over the four years we’ve been using and developing this methodology, it has always been possible to relate such issues to a WCAG1.0 checkpoint, even if the questions we use in the assessment spreadsheet don’t address it directly.