Talk:Content similarity detection

Suggested improvements

There is a large body of academic research in this area.

This page needs a major rewrite to include references.

The structure is fine as an outline.

The following improvements should be made:

First, this section is related to plagiarism detection in natural language, e.g. English. It is not related to detection in other areas, e.g. computer source code, sheet music, diagrams.

Search engines - these are ineffective as they cannot find text in a private database, e.g. a protected forum or an electronic archive of research articles. Detection software - there is terminology in the research literature to describe different categorisations of software, which should be used here. Detection algorithms - there are many proposed algorithms and comparative reviews of them exist. There is no reason why one algorithm should be singled out for inclusion.

Source code plagiarism detection - different methods are used for detecting plagiarism in computer source code - and this is again a substantial research area that needs to be referenced. Source code detection engines were developed before those for natural language and informed the development of natural language engines.

There are detection methods, other than just looking for exact text, which can be used for plagiarism detection. For instance, the Glatt detection method, or software that looks for changes in writing style within a document.

There is also a popular research area on plagiarism prevention. This involves designing out opportunities for students to plagiarise by using new/improved methods of assessment (e.g. one, perhaps draconian, example is to replace all courseworks with examination.

List of free and commercial alternatives is a total spam magnet, it must go!

The subject line of my post speaks for itself. This needs to go! I think I might be bold and just remove it myself now. If you restore it, please respond on talk page. --Jaysweet 17:17, 13 July 2007 (UTC)[reply]

Pending edits

Better introductions to free-text and software plagiarism can be written, adding references to literature on relevant algorithms. Current lists of software can be cleaned, perhaps to turn them into tables to make them more amenable to comparison. Khondrion (talk) 20:02, 18 September 2008 (UTC)[reply]

Unless the links are internal (i.e. lead to other Wikipedia articles, such as Copyscape) or can be sourced to news coverage in a publication meeting WP:RS, they will probably be removed as non-notable. These articles attract a great deal of spam links. Flowanda | Talk 21:50, 25 September 2008 (UTC)[reply]

Notoriety of existing systems

Internal links to systems that are obscure outside academia should be avoided; Only JPlag and MOSS stand any chance there. 'SIM' or 'AC' are short names that are already very ambiguous and have a small impact outside academia, are more confusing than useful. An alternative is to use things like 'SID (source code plagiarism)' so that the link becomes more specific. This is what i've done with 'Moss (program)'. Khondrion (talk) 11:34, 30 September 2008 (UTC)[reply]

In that case, the others probably should be moved down into the external-links section (since there's no Wikipedia topic dealing with them). Tedickey (talk) 11:53, 30 September 2008 (UTC)[reply]

Related pages

There seems to be no page on general similarity detection, either for text (natural language) or for program similarity. The classification of similarity detection for source code can be expanded into a full article. Additionally, I can't find any page on normalized compression simililarity (NCD), very popular in general similarity detection, specially in bioinformatics. Khondrion (talk) 11:34, 30 September 2008 (UTC)[reply]