Feel free to include my content in your page via my
RSS feed Follow @irongeek_adc

Help Irongeek.com pay for
bandwidth and research equipment:
Subscribestar or Patreon

Search Irongeek.com:

Affiliates:

Help Irongeek.com pay for bandwidth and research equipment:

Man page of BULK_EXTRACTOR

BULK_EXTRACTOR

Section: User Commands (1)
Updated: MAY 2010
Index of this MAN page

Back To MAN Pages From BackTrack 5 R1 Master List

NAME

bulk_extractor - Scans a disk image for regular expressions and other content.

SYNOPSIS

bulk_extractor -o output_dir [options] image

DESCRIPTION

bulk_extractor scans a disk image (or any other file) for a large number of pre-defined regular expressions and other kinds of content. These items are called features. When it finds a feature, bulk_extractor writes the output to an output file. Each line of the output file contains a byte offset at which the feature was found, a tab, and the actual feature. Features therefore cannot contain the end-of-line character.

bulk_extractor includes native support for EnCase (.E01) and AFFLIB (.aff) files, if it compiled and linked on a system containing those libraries.

bulk_extractor is multi-threaded. By specifying the -j option, multiple copies of the program can be run. Each thread writes its results into its own feature file. The files are then combined by the primary thread when all of the secondary threads complete.

bulk_extractor is a two-phase program. In phase 1 the features are extracted. In phase 2 a histogram is created of relevant features.

bulk_extractor will also create a wordlist of all the words that are found in the disk image. This can be used as a dictionary for cracking encryption.

The options are as follows:

-o outdir

Specifies the output directory, which will be created by bulk_extractor.

-b bannerfile.txt

Read the contents of bannerfile.txt and stamp it at the beginning of each output file. This might be useful if you have some kind of privacy banner that needs to be stamped at the top of all of your files.

-r alert_list.txt

Specifies an alert list, (or red list), which is a list of terms that, if found, will be specifically flagged in a special alert file that begins with the letters ALERT. The alert list may contain individual terms, which must be found in their entirity and are case-sensitive, or wildcards with standard Unix globbing (e.g. *@company.com). Globbed terms are case-insensitive.

-w stop_list.txt

Specifies a stop list, (or white list), which is a list of terms that, if found, will be placed in a special stopped file (rather than in the main file). The whitelist may also contain globbed terms.

-s context_stoplist.txt

Specifies a context-dependent stoplist, which is a list of tokens to be stopped (but only when they have a particular context).

-p path/format

Open a disk image and print the information found at path. The format specification may be r for raw output and h for hex output.

-q nn

Quiet mode. Only prints every nn status reports.

These commands are useful for tuning operation:

-g n

Specifies the size of the margin in bytes.

-j n

Use n threads for analysis.

-Wn1:n2

The scan_wordlist scanner should only extract words that are between n1 and n2 characters in length.

The following commands are useful for debugging:

-V

Print the version number

-c

Enable crash protection.

-R outdir

Restarts the program from where it left off for a particular directory.

-dN

Enable debugging level N.

-B nn

Set the dedup Bloom filter to nn bits.

-M nn

Specifies a maximum recursion depth of nn.

-z pagenum

Start on page number pagenum.

-Y offset

Start at offset

offset.

HISTORY

bulk_extractor is based on a feature extractor and named entity recognizer developed for SBook in 1989. The feature extractor was repurposed for disk images in 2003. The stand-alone bulk_extractor program was rewritten in 2005 and publicly released in 2007. The multi-threaded bulk_extractor was released in May 2010.

AUTHOR

Simson Garfinkel <simsong@acm.org>

This document was created by man2html, using the manual pages.
Time: 07:34:21 GMT, September 13, 2011

Printable version of this article

15 most recent posts on Irongeek.com:

If you would like to republish one of the articles from this site on your webpage or print journal please contact IronGeek.