https://www.tutorialspoint.com/tika/tika_language_detection.htm
LanguageDetector detector = new OptimaizeLangDetector().loadModels();
LanguageResult result = detector.detect("Alla människor är födda fria och lika i värde och rättigheter.");
https://www.tutorialspoint.com/tika/tika_language_detection.htm
http://stackoverflow.com/questions/7025915/what-is-the-best-language-detect-library-or-web-api-available-even-paid
@Deprecated use
Tika IOException: mark/reset not supported
metadata.get(Metadata.IMAGE_WIDTH)
How to check that file content really image
http://www.jeroenreijn.com/2010/04/metadata-extraction-with-apache-tika.html
final Metadata metadata = new Metadata();
final ParseContext context = new ParseContext();
new JpegParser().parse(fileInputStream, new DefaultHandler(), metadata, context);
final String imageWidth =
MoreObjects.firstNonNull(metadata.get("tiff:ImageWidth"), metadata.get("Image Width"));
final String imageHeight =
MoreObjects.firstNonNull(metadata.get("tiff:ImageLength"), metadata.get("Image Height"));
https://imagemagick.org/discourse-server/viewtopic.php?t=27037
The formal designations of the fields are:
http://lucene.472066.n3.nabble.com/Thread-Safety-td646195.html
Is AutoDetectParser thread-safe and/or all the other parsers?
Yes. Once initialized, a parser instance should be immutable and thus fully thread-safe.
Correct, they are just simple instances. Almost all the work is done in the parse() method.
To detect the language of a document, a language profile is constructed and compared with the profile of the known languages. The text set of these known languages is known as a corpus.
A corpus is a collection of texts of a written language that explains how the language is used in real situations.
The corpus is developed from books, transcripts, and other data resources like the Internet. The accuracy of the corpus depends upon the profiling algorithm we use to frame the corpus.
The common way of detecting languages is by using dictionaries. The words used in a given piece of text will be matched with those that are in the dictionaries.
A list of common words used in a language will be the most simple and effective corpus for detecting a particular language, for example, articles a, an, the in English.
Among all the 184 standard languages standardized by ISO 639-1, Tika can detect 18 languages.
https://issues.apache.org/jira/browse/TIKA-1723
The language-detector project at https://github.com/optimaize/language-detector is faster, has more languages (70 vs 13) and better accuracy than the built-in language detector.
This is a stab at integrating it, with some initial findings. There are a number of issues this raises, especially if Chris A. Mattmann moves forward with turning language detection into a pluggable extension point.
https://github.com/apache/tika/blob/master/tika-example/src/main/java/org/apache/tika/example/Language.javaLanguageDetector detector = new OptimaizeLangDetector().loadModels();
LanguageResult result = detector.detect("Alla människor är födda fria och lika i värde och rättigheter.");
https://www.tutorialspoint.com/tika/tika_language_detection.htm
http://stackoverflow.com/questions/7025915/what-is-the-best-language-detect-library-or-web-api-available-even-paid
@Deprecated use
org.apache.tika.language.detect.LanguageDetector
public String detectLangTika(String text) throws SystemException {
LanguageIdentifier li = new LanguageIdentifier(text);
if (li.isReasonablyCertain())
return li.getLanguage();
else
throw new Exception("Tika lang detection not reasonably certain");
}
Tika IOException: mark/reset not supported
Therefore, the stream you provide to this method must support the optional mark/resetfunctionality. Decorate your resource stream with a
BufferedInputStream
.//read audio data from whatever source (file/classloader/etc.)
InputStream audioSrc = getClass().getResourceAsStream("mySound.au");
//add buffer for mark/reset support
InputStream bufferedIn = new BufferedInputStream(audioSrc);
AudioInputStream audioStream = AudioSystem.getAudioInputStream(bufferedIn);
stream =
new
BufferedInputStream(stream);
tika.detect(stream)
tika.parse(fileInputStream, metadata);
metadata.get(Metadata.IMAGE_WIDTH)
Constantsmetadata.get(Metadata.IMAGE_WIDTH)
How to check that file content really image
You'll see that
MediaType
has a getType()
and a getSubtype()
method. What you are looking for is the type (ie. "image/*"
). The sub-type in this case would be "jpeg"
.
So your test should be:
if (mediaType.getType().equals("image")) {
// Deal with image
}
http://www.tutorialspoint.com/tika/tika_extracting_image_file.htmhttp://www.jeroenreijn.com/2010/04/metadata-extraction-with-apache-tika.html
final Metadata metadata = new Metadata();
final ParseContext context = new ParseContext();
new JpegParser().parse(fileInputStream, new DefaultHandler(), metadata, context);
final String imageWidth =
MoreObjects.firstNonNull(metadata.get("tiff:ImageWidth"), metadata.get("Image Width"));
final String imageHeight =
MoreObjects.firstNonNull(metadata.get("tiff:ImageLength"), metadata.get("Image Height"));
The formal designations of the fields are:
PixelXDimension (often reported as Exif Image Width)http://superuser.com/questions/994666/what-fields-in-exif-files-provide-image-height-width-information
PixelYDimension (often reported as Exif Image Height)
What fields in exif files provide image Height/Width information?
Here are the relevant Exif tags as defined in the Exif 2.3 standard (PDF Link):
Source Standard Exif Tags
|
Is AutoDetectParser thread-safe and/or all the other parsers?
Yes. Once initialized, a parser instance should be immutable and thus fully thread-safe.
Correct, they are just simple instances. Almost all the work is done in the parse() method.