Ferret is a copy-detection tool, created at the University of Hertfordshire by members of the Plagiarism Detection Group. Ferret locates duplicate text or code in multiple text documents or source files. The program is designed to detect copying (collusion) within a given set of files. Ferret works equally well with documents in natural language (such as English, German, etc) and with source-code files in a wide range of programming languages.
Typical uses for Ferret include:
- document analysis, tracking changes to documents;
- software developers, looking for duplicate code to refactor;
- software evolution, studying how code has changed over time;
- teachers, looking for collusion or plagiarism in student work; and,
- tracking the amount of new material in the current version of a text or program.
Features of Ferret are:
- compares text documents containing natural language or computer language
- automatic conversion of standard word processor or pdf formats to text
- processing specialised for major programming languages
- choice of similarity measures to highlight individual or group similarity
- quick loading and comparison of documents, up to computer's memory capacity
- display of all document comparisons, ranked by a similarity score
- detailed display of individual document comparisons, highlighting any text in common with the group or uniquely with compared document
- save result table and comparisons to pdf or xml formats, for printing or further analysis
- display of unique trigrams per document/group
- template material, whose contents can be excluded from analysis
- display of engagement (overlap with template material) per document/group
Ferret is available in two implementations. The first is a standalone application, described on this page, and the second is a Ruby library suitable for use as a simple web application or to include in scripts for customised analysis. Instructions for using the Ruby library version of Ferret are available at: uhferret-gem.
For a program to unpack compressed files and extract text/scripts from certain kinds of documents, see Unpacker.
Installation and Setup
System requirements: A 32-bit computer with at least 256Mb of memory and one of the supported operating systems. Linux versions require glibc 2.15 or higher. (Software compiled on XUbuntu 12.10.)
Download and install Ferret using one of the links to the left or below, following the instructions in the folder or within the installer. If you use an installer, you should find a menu option for Ferret in your 'Office' applications menu. Check your download is correct according to the following md5sum:
- Ferret 5.4 deb file for Ubuntu 12.10: md5sum 474843b8b0d114d7fa6386e8cc801f0a (Please ignore warnings about permissions.)
- Ferret 5.4 tgz file for generic Linux: md5sum 4e333f410e0efea594632c5c1d886ac7
Instructions for downloading and installing from source can be found on the github page.
Style sheet for displaying xml output in a web browser: uhferret.xsl.
Screenshots
Ferret is designed to be easy to use. The screen shots below show the typical stages in selecting, comparing and analysing a group of documents for copying. Help is available on most of the displays if you need further guidance.
| Select Files (on Linux GTK) | Comparison Table (on Linux GTK) |
|---|---|
|
|
Files or folders can be dragged and dropped into the white area of the select files dialog for comparison, or alternatively files can be selected using the buttons on the dialog. The tab labelled 'template' can be used for adding files or folders to be treated as template (or provided) material.
The comparison table shows all pairs of files, ranked by their similarity, and provides access to the other displays, by the buttons on the right of the display. Statistics on the number of files, number of comparisons, and the mean (average) similarity are displayed along the botton. Files from the template material are labelled "TM:". A checkbox toggles the display between showing the filenames only, or the complete path to each file.
Two further checkboxes change the calculation of the similarity measure:
- Remove common trigrams: calculates similarity only on the trigrams within the two documents along.
- Ignore template material: ignores any trigrams in the template material when calculating the similarity.
The tables of uniqueness or engagement can be displayed by clicking on their respective buttons.
| Comparing Documents (on Linux GTK) | Comparing Documents (on Linux GTK) |
|---|---|
|
|
The comparison of two documents forms the main analysis display of Ferret. The display shows the text in the two documents with shared trigrams highlighted: those in red are uniquely shared by the two documents, those in blue are additionally shared by other documents in the group, and those in green are shared with the template material. A list of the shared trigrams can be used to locate that trigram within the two documents.
| Uniqueness Table (on Linux GTK) | Engagement Table (on Linux GTK) |
|---|---|
|
|
The table of unique trigrams shows the number of trigrams unique to each document, and gives a measure of how different each document is from the group. For teachers, this can be a useful measure of originality (or a warning to look for plagiarism).
The table of engagement shows the number of trigrams also in the template material, and gives a measure of how much each document or student has taken from the template material. For teachers, this can be a useful measure of engagement with provided materials.
| Help Dialog (on Linux GTK) |
|---|
|
The Help dialog provides background information and help on interpreting and using each display in the application.
Similarity Measure
Ferret computes a similarity measure based on the trigrams found within each of the two documents under comparison; this measure is a number from 0 (no copying) to 1 (everything has been copied). This measure should not be taken as an absolute measure of the amount of copying. Instead, the measure is intended to indicate the relative amount of copying that the current pair has compared with the rest of the group. Pairs which appear on top of the table of all similarity comparisons should be examined for possible copying, but the measure itself does not imply any reliable conclusion.
Supported Document Types
Ferret internally analyses the text within documents. For certain word-processed or pdf documents, Ferret can automatically extract the text. The following list provides the recognised type and extension of document or computer code.
- Text documents (.txt)
- Word processor formats (.doc, .docx, .rtf, .abw)
- Pdf documents (.pdf)
- Computer languages
- ActionScript (.as, .actionscript)
- C/C++ (.h, .c, .cpp)
- C# (.cs)
- Clojure (.clj)
- Groovy (.groovy)
- Haskell (.hs, .lhs)
- Java (.java)
- Lisp (.lisp, .lsp)
- Lua (.lua)
- PHP (.php)
- Prolog (.pl)
- Python (.py)
- Racket (.rkt)
- Ruby (.rb)
- Scheme (.scm, .ss)
- Visual Basic (.vb)
- XML/HTML (.xml, .html)
Credits
The original ideas and motivations for Ferret are due to Caroline Lyon and James Malcolm. Earlier versions of Ferret were created by Bob Dickerson, James Malcolm and Ruth Barrett. The red-blue-black display, the measures of uniqueness, engagement and group interactions, are all due to Pam Green.
The current version of Ferret was been created in C++ by Peter Lane, under GNU/Linux. The graphical interface was developed using the wxWidgets cross-platform library. Text conversion is provided by Abiword and pdftotext.
Disclaimer
This software comes with ABSOLUTELY NO WARRANTY; users of this software will do so at their own risk. Neither the University of Hertfordshire nor the individuals involved in any part of this software will accept any liability for any damage or harm caused to the data, computer or computer software by using this software.
Download
- Ferret 5.4 for Ubuntu
- Ferret 5.4 for generic Linux
- Rubygems for Ruby version
- Source for 5.4 on Github