+ The Javadocs for the latest (development) version of Apache POI + can be accessed online here, or build + from a source code checkout + by running the javadocs Ant task. The + latest (development) Javadocs are generally + updated every few weeks, so may lag the most recent development slightly. +
++ For recent releases, the Javadocs for the latest stable release + each the family can also be browsed online: +
++ For every release of Apache POI, the specific Javadocs for that + version are available with the release. +
++ Maven / Gradle / IDE users are able to fetch the javadocs for each + of the Apache POI jars from Maven Central (or your preferred Maven + mirror). These are made available with the javadoc classifier, + e.g. group: 'org.apache.poi', name: 'poi', version: '4.1.1', + classifier: 'javadoc' +
++ If you have downloaded the binary (bin) release, then you + can find the Javadocs within the download in the /docs/apidocs/ + folder. +
++ If you have downloaded the source (src) release, then you + need to build your own copy. Run the javadocs ant task + to have the Javadocs built, the build will tell you the output + directory at the end (it varies slightly between POI versions). +
++ A number of people are using POI for a variety of purposes. As with + any new API or technology, the first question people generally ask + is not "how can I" but rather "Who else is doing what I'm about to + do?" This is understandable with the abysmal success rate in the + software business. These case statements are meant to help create + confidence and understanding. +
++ We are actively seeking case studies for this page (after all it + just started). To submit a case study, either + + submit a patch for this page or email it to the + mailing list + (with [PATCH] prefixed subject, please). +
+
+ Andreas Reichel, Managing Consultant
+
+ Use Case for Apache POI in VBox Financial Reporting Software
+ Manticore Projects specializes in Financial Valuation,
+ Accounting, and Reporting under IFRS 9, IFRS 16, and IFRS 17. The software extensively leverages
+ Apache POI for importing, exporting, and visualizing data, making it a cornerstone of the solutions.
+
+ SQL Sheet Integration for Data Capture
+ The software uses and supports SQL Sheet to build
+ "Data Capture Sheets", allowing end-users to seamlessly upload structured data via Microsoft Excel
+ spreadsheets into applications. SQL Sheet, a JDBC driver
+ for XLS/XLSX files based on Apache POI, transforms worksheets into database tables, enabling access
+ through plain SQL and JDBC MetaData.
+
+ Streamlined Excel Exports for Controllers and Auditors
+ Within VBox applications, Apache POI enables interactive export of UI content, such as data tables, into
+ formatted Excel spreadsheets. This functionality provides financial controllers and auditors with easy
+ access to complex data and calculations in a familiar format.
+
+ ETL-VBox Report Builder for Regulatory Compliance
+
+ The ETL-VBox Report Builder uses Apache POI
+ to create spreadsheet-based form reports, a critical requirement for regulatory reporting. Regulatory
+ bodies often provide specific MS Excel templates with multiple sheets representing data forms and fields.
+ With Apache POI, the software visualizes these Excel templates directly in the UI, mimicking the Excel
+ experience. Non-technical users can drag and drop records or values from data cubes into the spreadsheet
+ interface. This "data to cell-range" mapping is stored and used to populate the workbook automatically,
+ ensuring reports are generated accurately and on time—such as during daily end-of-day processes.
+ One of the standout benefits of this approach is the platform independent separation of report templates
+ (including corporate design styles, formulas, and charts) from the actual data. By leveraging Apache POI,
+ it bridges the gap between structured data and Excels flexibility, delivering the best of both worlds for
+ end-users who love working in Excel.
+
+ Why Apache POI?
+ Apache POI has proven to be a high-performance and robust library. It is supported by comprehensive
+ documentation and an excellent community of developers. At
+ Manticore Projects, we are proud
+ contributors to this vibrant community and deeply value the collaboration that drives the evolution of this
+ indispensable tool.
+ By integrating Apache POI into our software, we empower users with intuitive and powerful features for
+ financial reporting, helping them meet their regulatory and operational needs with confidence.
+
+ This WriteExcel distribution package found at WriteExcel Utilities contains source, + documentation, examples, build tools and precompiled classes that wrap the Apache POI Excel interface. +
+
+ WriteExcel creates a Workbook file using a simple interface that uses formatted strings as the primary way
+ of passing information to the support methods which interpret the strings and issue the necessary POI method calls.
+ Access to the Workbook object allows the POI methods to be called directly for cases not handled by the interface.
+
+ An existing Workbook file can be used as a template source so that sheets can be copied and then left intact, modified and/or supplemented. +
++ The creation of Workbooks containing charts is supported by using an existing Workbook file as a template that contains one or more charts + and using WriteExcel to modify the data that the chart refers to. +
+
+ The ReadExcelFile component of the package can be used to selectively iterate across existing Workbooks (or Workbooks under construction) and create
+ Java objects with the selected data which can then be forwarded for further processing.
+
+ WriteExcel was used to produce the monthly reporting files for a church accounting system among other things.
+
+ Steve Pritchard
+ Rexcel Systems Inc.
+ July, 2019
+ Steve Pritchard Utilities
+
+ As a small startup there is no attendance management system in place. So they have a manual register where + they record attendance. There also is a biometric scanner to allow entries through the office gates, + which again maintains logs of entries. + Instead of establishing an attendance management system, they decided to make use of these biometric scanner logs and generate an + excel report instead. +
++ A blog post describes how + the startup uses Apache POI to generate reports about attendance of employees based on biometric scanner logs. +
++ A fully working solution can be found on Github. +
++ REWOO Scope is a modern and easy to use web-based enterprise content management system. It supports knowledge workers and managers in making the right decisions based upon all relevant information. +
++ The system uses Apache POI to extract information stored within excel files and use it transparently within REWOO Scope. Thus, POI allows our customers to work in their standard office environment while also having all important information in the REWO Scope system. +
++ QuestionPro is an online service allowing businesses and individuals to create, deploy and do in-depth analysis of Online Surveys. The technology is build on open-source frameworks like Struts, Velocity, POI, Lucene ... the List goes on. The application deployment is on a Linux Application Cluster farm with a Mysql database. +
++ There are quite a few competitors delivering similar solutions using Microsoft Technologies like asp and .net. One of the distinct advantages our competitors had over us was the ability to generate Excel Spreadsheets, Access Databases (MDB) etc. on the fly using the Component Object Model (COM) - since their servers were running IIS and they had access to the COM registry and such. +
++ QuestionPro's initial solution was to generate CSV files. This was easy however it was a cumbersome process for our clients to download the CSV files and then import them into Excel. Moreover, formatting information could not be preserved or captured using the CSV format. This is where POI came to our rescue. With a POI based solution, we could generate a full report with multiple sheets and all the analytical reports. To keep the solution scalable, we had a dedicated cluster for generating out the reports. +
++ + The Apache-POI project has helped QuestionPro compete with the other players in the marketplace with proprietary technology. It leveled the playing field with respect to reporting and data analysis solutions. It helped in opening doors into closed solutions like Microsoft's CDF. Today about 100 excel reports are generated daily, each with about 10-30 sheets in them. +
+ ++ Vivek Bhaskaran +
++ QuestionPro, Inc +
+ ++ POI In Action - http://www.questionpro.com/marketing/SurveyReport-289.xls +
+ ++ Sunshine Systems developed a + POI based reporting solution for a price optimization software package which + is used by major retail chains. +
+The solution allowed the retailer's merchandise planners and managers to request a + markdown decision support reports and price change reports using a standard browser + The users could specify report type, report options, as well as company, +division, + and department filter criteria. Report generation took place in the +multi-threaded + application server and was capable of supporting many simultaneous report requests. +
+The reporting application collected business information from the price +optimization + application's Oracle database. The data was aggregated and summarized + based upon the + specific report type and filter criteria requested by the user. The +final report was + rendered as a Microsoft Excel spreadsheet using the POI HSSF API and + was stored on + the report database server for that specific user as a BLOB. Reports + could be + seamlessly and easily viewed using the same browser. +
+The retailers liked the solution because they had instantaneous access + to critical + business data through an extremely easy to use browser interface. They + did not need + to train the broader user community on all the complexities of the optimization + application. Furthermore, the reports were generated in an Excel spreadsheet +format, + which everyone was familiar with and which also allowed further data + analysis using + standard Excel features. +
+Rob Stevenson (rstevenson at sunshinesys dot com) +
++ The + Bank of Lithuania + reports financial statistical data to Excel format using the + Apache POI + project's + + HSSF API. The system is based on Oracle JServer and + utilizes a Java stored procedure that outputs to XLS format + using the HSSF API. - Arian Lashkov (alaskov at lbank.lt) +
++ Edwards and Kelcey Technology (http://www.ekcorp.com/) developed a + Facility + Management and Maintenance System for the Telecommunications industry + based + on Turbine and Velocity. Originally the invoicing was done with a simple + CSV + sheet which was then marked up by accounts and customized for each client. + As growth has been consistent with the application, the requirement for + invoices that need not be touched by hand increased. POI provided the + solution to this issue, integrating easily and transparently into the + system. POI HSSF was used to create the invoices directly from the server + in + Excel 97 format and now services over 150 unique invoices per month. +
++ Cameron Riley (crileyNO@ SPAMekmail.com) +
++ ClickFind Inc. used the POI + projects HSSF API to provide their medical + research clients with an Excel export from their electronic data + collection web service Data Collector 3.0. The POI team's assistance + allowed ClickFind to give their clients a data format that requires less + technical expertise than the XML format used by the Data Collector + application. This was important to ClickFind as many of their current + and potential clients are already using Excel in their day-to-day + operations and in established procedures for handling their generated + clinical data. - Jared Walker (jared.walker at clickfind.com) +
+In addition to Change Management and Database Modelling, IKAN Software NV + (http://www.ikan.be/) develops and supports its own ETL + (Extract/Transform/Load) tools.
+ +IKAN's latest product is this domain is called ETL4ALL + (http://www.ikan.be/etl4all/). ETL4ALL is an open source tool + allowing data transfer from and to virtually any data source. Users can + combine and examine data stored in relational databases, XML databases, PDF + files, EDI, CSV files, etc. +
+ +It is obvious that Microsoft Excel files are also supported. + POI has been used to successfully implement this support in ETL4ALL.
++ On its ForecastWorks website + JM Lafferty Associates, Inc. produces dynamic on demand + financial analyses of companies and institutional funds. The pages produced are selected and exported + in several file formats including PPT and XLS. +
++ David Fisher (dfisher@jmlafferty.com) +
++ IDD have developed the iEXL product to + generate Excel spreadsheets directly on the Iseries/AS400 IBM I on Power platform. +
++ Professional spreadsheets created via a menu system. Some basic programming is required for more complex options. + When programming is required it can be carried out using RPG, SQL, QUERY, JAVA, COBOL etc. + In other words your existing staffs knowledge +
++ Design spreadsheets with: +
++ The product name is 'iEXL' and has been live on both European and North American systems for over four years. + It is being used in preference to more established commercial products which our clients have already purchased. + This is due to cost and ease of use. +
++ All spreadsheets can be archived if required so that historical spreadsheets can be retrieved. +
++ The system has benefits for all departments within an organisation. + Examples of this are accounts department for things such as aged trial balance, + distribution department for ASN’s, warehousing for stock figures, IS for security reporting etc. +
++ Clients have at this point (June 2012) created over 300 spreadsheets which in turn have generated over + 500,000 E-mails. iEXL has a menu driven email system. +
++ Due to the Apache-POI project IDD have been able to create the IEXL product. + This is a well priced product which allows companies of all sizes access to a product that opens up their reporting capabilities +
++ Within the iEXLSOFTWARE.COM website you will find a full user manual, + installation instructions, a call log (Ticket) system and a downloadable 45 day trial version. +
++ Author: Mark.D.Golden +
++ Ugly Duckling focus on Software, Management and Finance. + We have recently been using Apache POI to create tools for the mortgage group of + ABN AMRO in the Netherlands. + During this project we created a number of what we call 'Robots' using the HSSF API. +
++ These robots run as services on the network and + help automate the processing of large amounts of data. Our Robots can be used to spot problems that + a human might not, and also to automate repetitive tasks. +
++ We found Apache POI to be extremely useful. We took the base API, wrapped it in a builder pattern and + thus created a DSL with a fluid interface. Throughout the project we enjoyed very much working with + Apache POI and found it to be very reliable. +
+Deutsche Bahn uses POI's HWPF component to process complex specification documents stored in the legacy Microsoft Word file format.
++ In a joint effort with other international partners, Deutsche Bahn Netz AG, + the owner of the German rail infrastructure, developed a novel software toolchain to facilitate the creation of an interoperable on-board component + for a pan-European train protection system. One part of this toolchain is a domain-specific specification processor which reads the relevant + requirements documents using Apache POI, enhances them and ultimately stores their contents as ReqIF. + Contrary to DOC, this XML-based file format allows for proper traceability and versioning in a multi-tenant environment. Thus, it lends itself much + better to the management and interchange of large sets of system requirements. The resulting ReqIF files are then consumed by the various tools in + the later stages of the software development process. +
++ Currently available, off-the-shelf software for requirement import performed very poorly on the original specification documents due to their + structural complexity and heterogeneous formatting. POI not only helped to create a superior solution thanks to its rich API. Because of its + open-source nature it also plays a key role in ensuring the maintainability of the resulting system which is expected to stay in operation for + many decades to come. +
++ POI has seen various enhancements for this challenging application. Most notably, these include the addition of extensive list numbering support, + a feature which is now part of Apache TIKA. Numerous smaller improvements, such as support for table cell background shadings, interpretation of + certain kinds of OfficeDrawings, and proper conversion of special characters, also helped to derive meaning from the input files. See + here for details. +
++ This work was funded by the German Federal Ministry of Education and Research (Grant No. 01IS12021) in the context of the ITEA2 project + openETCS. +
+The change log for POI 3.x and + older releases + can be found in the history section. +
+The best way to learn about using Apache POI is to read through the feature documentation + and other online examples online. +
+To keep the features documentation focused on the APIs, there is little mention of some of the configuration + settings that can be enabled that may prove useful to users who have to handle very large documents or very + large throughput. +
+These API methods allow to configure behavior of Apache POI for special needs, e.g. when processing excessively + large files. +
+| Configuration Setting | +Description | +
|---|---|
| org.apache.poi.ooxml.POIXMLTypeLoader.DEFAULT_XML_OPTIONS | +POI support for XSSF APIs relies heavily on XMLBeans. + This instance can be configured. + It is recommended to take care if you do change any of the config items. + In POI 5.1.0, we will disallow Doc Type parsing in the XML files embedded in xlsx/docx/pptx/etc files, by default. + DEFAULT_XML_OPTIONS.setDisallowDocTypeDeclaration(false) will undo this change. + | +
| + org.apache.poi.util.IOUtils.setByteArrayMaxOverride(int maxOverride) + | +If this value is set to > 0, IOUtils.safelyAllocate(long, int) will ignore the maximum record length parameter.
+ This is designed to allow users to bypass the hard-coded maximum record lengths if they are willing to accept the risk of allocating memory up to the size specified.
+ It also allows to impose a lower limit than used for very memory constrained systems.
+ + Note: This is a per-allocation limit and does not allow you to limit overall sum of allocations! Use -1 for using the limits specified per record-type. + + |
+
| + org.apache.poi.openxml4j.util.ZipSecureFile.setMinInflateRatio(double ratio) + | +Sets the ratio between de- and inflated bytes to detect zipbomb. + It defaults to 1% (= 0.01d), i.e. when the compression is better than 1% for any given read package part, the parsing will fail indicating a Zip-Bomb. + | +
| + org.apache.poi.openxml4j.util.ZipSecureFile.setMaxEntrySize(long maxEntrySize) + | +Sets the maximum file size of a single zip entry. It defaults to 4GB, i.e. the 32-bit zip format maximum. + This can be used to limit memory consumption and protect against security vulnerabilities when documents are provided by users. + POI 5.1.0 removes the previous limit of 4GB on this setting. + | +
| + org.apache.poi.openxml4j.util.ZipSecureFile.setMaxTextSize(long maxTextSize) + | +Sets the maximum number of characters of text that are extracted before an exception is thrown during extracting text from documents. + This can be used to limit memory consumption and protect against security vulnerabilities when documents are provided by users. + The default is approx 10 million chars. Prior to POI 5.1.0, the max allowed was approx 4 billion chars. + | +
| org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.setThresholdBytesForTempFiles(int thresholdBytes) + | +Added in POI 5.1.0. + Number of bytes at which a zip entry is regarded as too large for holding in memory + and the data is put in a temp file instead - defaults to -1 meaning temp files are not used + and that zip entries with more than 2GB of data after decompressing will fail, 0 means all + zip entries are stored in temp files. A threshold like 50000000 (approx 50Mb is recommended) + | +
| org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.setEncryptTempFiles(boolean encrypt) + | +Added in POI 5.1.0. + Whether temp files should be encrypted (default false). Only affects temp files related to zip entries. + | +
| org.apache.poi.openxml4j.opc.ZipPackage.setUseTempFilePackageParts(boolean tempFilePackageParts) + | +Added in POI 5.1.0. + Whether to save package part data in temp files to save memory (default=false). + | +
| org.apache.poi.openxml4j.opc.ZipPackage.setEncryptTempFilePackageParts(boolean encryptTempFiles) + | +Added in POI 5.1.0. + Whether to encrypt package part temp files (default=false). + | +
| org.apache.poi.extractor.ExtractorFactory.setThreadPrefersEventExtractors(boolean preferEventExtractors) and + org.apache.poi.extractor.ExtractorFactory.setAllThreadsPreferEventExtractors(Boolean preferEventExtractors) + | ++ When creating text-extractors for documents, allows to choose a different type of extractor which parses documents + via an event-based parser. + | +
| Various classes: setMaxRecordLength(int length) + | +
+ Allows to override the default max record length for various classes which
+ parse input data. E.g. XMLSlideShow, XSSFBParser, HSLFSlideShow, HWPFDocument,
+ HSSFWorkbook, EmbeddedExtractor, StringUtil, ...
+ + This may be useful if you try to process very large files which otherwise trigger + the excessive-memory-allocation prevention in Apache POI. + |
+
| org.apache.poi.xslf.usermodel.XSLFPictureData.setMaxImageSize(int length) + | ++ Allows to override the default max image size allowed for XSLF pictures. + | +
| org.apache.poi.xssf.usermodel.XSSFPictureData#setMaxImageSize(int length) + | ++ Allows to override the default max image size allowed for XSSF pictures. + | +
| org.apache.poi.xwpf.usermodel.XWPFPictureData#setMaxImageSize(int length) + | ++ Allows to override the default max image size allowed for XWPF pictures. + | +
Apache POI supports some Java System Properties. +
+| System property | +Description | +
|---|---|
| java.io.tmpdir | ++ Apache POI uses the default mechanism of the JDK for specifying the location of + temporary files. + | +
| org.apache.poi.hwpf.preserveBinTables and org.apache.poi.hwpf.preserveTextTable | ++ Allows to adjust how parsing Word documents via HWPF is handling tables. + | +
| org.apache.poi.ss.ignoreMissingFontSystem | +Added in POI 5.2.3.
+ Instructs Apache POI to ignore some errors due to missing fonts and thus allows
+ to perform more functionality even when no fonts are installed.
+ + Note: Some functionality will still not be possible as it cannot use default-values, e.g. rendering + slides, drawing, ... + |
+
HDGF is the POI Project's pure Java implementation of the + Visio binary (VSD) file format. XDGF is the POI Project's + pure Java implementation of the Visio XML (VSDX) file format.
+ +Currently, HDGF provides a low-level, read-only api for + accessing Visio documents. It also provides a + way + to extract the textual content from a file. +
+At this time, there is no usermodel api or similar, + only low level access to the streams, chunks and chunk commands. + Users are advised to check the unit tests to see how everything + works. They are also well advised to read the documentation + supplied with + vsdump + to get a feel for how Visio files are structured.
+To get a feel for the contents of a file, and to track down + where data of interest is stored, HDGF comes with + VSDDumper + to print out the contents of the file. Users should also make + use of + vsdump + to probe the structure of files.
+ +Currently, HDGF is only able to read visio files, it is + not able to write them back out again. We believe the + following are the steps that would need to be taken to + implement it.
+The purpose of this document is to give a brief high level overview of the + HWPF document format. This document does not go into in-depth technical + detail and is only meant as a supplement to the Microsoft Word 97-2007 + Binary File Format freely available from + Microsoft.
+The OLE file format is not discussed in this document. It is assumed that + the reader has a working knowledge of the POIFS API.
+ +A Word file is made up of the document text and data structures + containing formatting information about the text. Of course, this is a + very simplified illustration. There are fields and macros and other + things that have not been considered. At this stage, HWPF is mainly + concerned with formatted text.
+The entry point for HWPF's reading of a Word file is the File Information + Block (FIB). This structure is the entry point for the locations and size + of a document's text and data structures. The FIB is located at the + beginning of the main stream.
+The document's text is also located in the main stream. Its starting + location is given as FIB.fcMin and its length is given in bytes by + FIB.ccpText. These two values are not very useful in getting the text + because of unicode. There may be unicode text intermingled with ASCII + text. That brings us to the piece table.
+The piece table is used to divide the text into non-unicode and unicode + pieces. The size and offset are given in FIB.fcClx and FIB.lcbClx + respectively. The piece table may contain Property Modifiers (prm). + These are for complex(fast-saved) files and are skipped. Each text piece + contains offsets in the main stream that contain text for that piece. + If the piece uses unicode, the file offset is masked with a certain bit. + Then you have to unmask the bit and divide by 2 to get the real file + offset.
+All text formatting is based on styles contained in the StyleSheet. + The StyleSheet is a data structure containing among other things, style + descriptions. Each style description can contain a paragraph style and + a character style or simply a character style. Each style description + is stored in a compressed version on file. Basically these are deltas + from another style.
+Eventually, you have to chain back to the nil style which is an + imaginary style with certain implied values.
+Paragraph and Character formatting properties for a document's text are + stored on file as deltas from some base style in the Stylesheet. The + deltas are used to create a complete uncompressed style in memory.
+Uncompressed paragraph styles are represented by the Pargraph + Properties(PAP) data structure. Uncompressed character styles are + represented by the Character Properties(CHP) data structure. The styles + for the document text are stored in compressed format in the + corresponding Formatted Disk Pages (FKP). A compressed PAP is referred + to as a PAPX and a compressed CHP is a CHPX. The FKP locations are + stored in the bin table. There are separate bin tables for CHPXs and + PAPXs. The bin tables' locations and sizes are stored in the FIB.
+A FKP is a 512 byte OLE page. It contains the offsets of the beginning + and end of each paragraph/character run in the main stream and the + compressed properties for that interval. The compressed PAPX is based on + its base style in the StyleSheet. The compressed CHPX is based on the + enclosing paragraph's base style in the Stylesheet.
+All compressed properties(CHPX, PAPX, SEPX) contain a grpprl. A grpprl + is an array of sprms. A sprm defines a delta from some base property. + There is a table of possible sprms in the Word 97 spec. Each sprm is a + two byte operand followed by a parameter. The parameter size depends on + the sprm. Each sprm describes an operation that should be performed on + the base style. After every sprm in the grpprl is performed on the base + style you will have the style for the paragraph, character run, + section, etc.
+HWPF is the name of our port of the Microsoft Word 97(-2007) file format + to pure Java. It also provides limited read only support for the older + Word 6 and Word 95 file formats.
+ +The partner to HWPF for the new Word 2007 .docx format is XWPF. + Whilst HWPF and XWPF provide similar features, there is not a common + interface across the two of them at this time.
+ +Both HWPF and XWPF could be described as "moderately functional". For some + use cases, especially around text extraction, support is very strong. For + others, support may be limited or incomplete, and it may be necessary to + dig down into low-level code. Error checking may be missing in places, + so it may be possible to accidentally generate invalid files. Enhancements + to fix such things are generally very well received!
+ +As detailed in the Components + Page, HWPF is contained within the poi-scratchpad-XXX.jar, while XWPF + is in the poi-ooxml-XXX.jar. You will need to ensure you include the appropriate + jars (and their dependencies!) in your classpath to use HWPF or XWPF.
+ +Please note that in version 3.12, due to a bug, you might need to include + poi-scratchpad-XXX.jar when using XWPF. This has been fixed again for the next + release as there should not be such a dependency.
+ ++ Source in the org.apache.poi.hwpf.model tree is the Java representation of + internal Word format structure. This code is "internal", it shall not + be used by your code. Code from org.apache.poi.hwpf.usermodel + package is actual public and user-friendly (as much as possible) API to access document + parts. Source code in the + org.apache.poi.hwpf.extractor + tree is a wrapper of this to facilitate easy extraction of interesting things (eg the Text), + and + org.apache.poi.hwpf.converter + package contains Word-to-HTML and Word-to-FO converters (latest can be used to generate PDF + from Word files when using with + Apache FOP + ). Also there is a small file-structure-dumping utility in + org.apache.poi.hwpf.dev + package, primally for developing purposes. +
+ ++ The main entry point to HWPF is HWPFDocument. Currently it has a lot of references both to + internal interfaces ( + org.apache.poi.hwpf.model + package) and public API ( + org.apache.poi.hwpf.usermodel + ) package. It is possible that it will be split into two different interfaces (like WordFile + and WordDocument) in later versions. +
+ ++ The main entry point to XWPF is XWPFDocument. From there, you can get the + paragraphs, pictures, tables, sections, headers etc. +
++ Currently, there are only a handful of example programs using HWPF and XWPF + available. They can be found in svn in the examples section, under + HWPF + and + XWPF. + Both HWPF and XWPF have fairly high levels of unit test coverage, which + provides examples of using the various areas of functionality of both + modules. These can be found in svn, under + HWPF + and + XWPF. + Contributions of more examples, whether inspired by the unit tests or + not, would be most welcomed! +
+ +A .doc Word document, as handled by HWPF, can be considered as very long single + text buffer. The HWPF API provides "pointers" + to document parts, like sections, paragraphs and character runs. Usually user will iterates + over main document part sections, paragraphs from sections and character runs from + paragraph. Each such interface is a pointer to document text subrange along with additional + properties (and they all extends same Range parent class). There is additional Range + implementations like Table, TableRow, TableCell, etc. Some structures like Bookmark or Field + can also provide subranges pointers. +
+ +Changing file content usually requires a lot of synchronized changes in those structures like + updating property boundaries, position handlers, etc. Because of that HWPF API shall be + considered as not thread safe. In addition, there is a "one pointer" rule for changing + content. It means you should not use two different Range instances at one time. More + precisely, if you are changing file content using some range pointer, all other range + pointers except parents' ones become invalid. For example if you obtain overall range (1), + paragraph range (2) from overall range and character run range (3) from paragraph range and + change text of paragraph, character run range is now invalid and should not be used, but + overall range pointer still valid. Each time you obtaining range (pointer) new instance is + created. It means if you obtained two range pointers and changed document text using first + range pointer, second one became invalid. +
+ +At the moment, XWPF covers many common use cases for reading and writing + .docx files. Whilst this is a great thing, it does mean that XWPF does + everything that the current POI committers need it to do, and so none of + the committers are actively adding new features.
+ +If you come across a feature in XWPF that you need, and isn't currently + there, please do send in a patch to add the extra functionality! More details + on contributing patches are available on the "Contribution to POI" page.
+At the moment we unfortunately do not have someone taking care for HWPF + and fostering its development. What we need is someone to stand up, take + this thing under his hood as his baby and push it forward. Ryan Ackley, + who put a lot of effort into HWPF, is no longer on board, so HWPF is an + orphan child waiting to be adopted.
+ +If you are interested in becoming the new HWPF + pointman, you should look into the Microsoft Word internals. A good + starting point seems to be Ryan Ackley's overview. An introduction to the binary + file formats is available + from Microsoft, which has some good references and links. After that, + the full details on the word format are available from + Microsoft, + but the documentation can be a little hard to get into at first... Try reading the + overview first, and looking at the existing + code, then finally look up the documentation for specific missing features.
+ +As a first step you should familiarize yourself with the source code, + examples, test cases, and the HWPF patches available at Bugzilla (if any). Then you + should compile an overview of
+ +When you start coding, you will not yet have write access to the + SVN repository. Please submit your patches to Bugzilla and nag the dev list until someone commits + them. Besides the actual checking in of HWPF patches, current POI + committers will also do some minor reviews now and then of your source code + patches, test cases and documentation to help ensure software quality. But + most of the time you will be on your own. However, anyone offering useful + contributions over a period of time will be offered committership!
+ +Please do not forget to write JUnit test cases and documentation! + We won't accept code that doesn't come with test cases. And please + consider that other contributors should be able to understand your source + code easily. If you need any help getting started with JUnit test cases + for HWPF, please ask on the developers' mailing list! If you show that you + are prepared to stick at it you will most likely be given SVN commit + access. See "Contribution to POI" page + for more details and help getting started.
+ +Of course we will help you as best as we can. However, presently there + is no committer who is really familiar with the Word format, so you'll be + mostly on your own. We are looking forward for you and your contributions! + Honor and glory of becoming a POI committer are waiting!
+HWPF Milestones
+| + Milestones + | ++ Target Date + | ++ Owner + | +
|---|---|---|
| + Read in a Word document +with minimum formatting +(no lists, tables, footnotes, +endnotes, headers, footers) +and write it back out with the +result viewable in Word +97/2000 + | ++ 07/11/2003 + | ++ Ryan + | +
| + Add support for Lists and +Tables + | ++ 8/15/2003 + | ++ + | +
| + HWPF 1.0-alpha release with +documentation and examples + | ++ 8/18/2003 + | ++ Praveen/Ryan + | +
| + Add support for Headers, +Footers, endnotes, and +footnotes + | ++ 8/31/2003 + | ++ ? + | +
| + Add support for forms and +mail merge + | ++ September/October 2003 + | ++ ? + | +
HWPF Task Lists
+Read in a Word document with minimum formatting (no lists, tables, footnotes, +endnotes, headers, footers) and write it back out with the result viewable in Word 97/2000
+| + Task + | ++ Target Date + | ++ Owner + | +
|---|---|---|
| + Create classes to read and +write low level data +structures with test cases + | ++ 7/10/2003 + | ++ Ryan + | +
| + Create classes to read and +write FontTable and Font +names with test case + | ++ 7/10/2003 + | ++ Praveen + | +
| + Final test + | ++ 7/11/2003 + | ++ Ryan + | +
Develop user friendly API so it is fun and easy to read and write word documents +with java.
+| + Task + | ++ Target Date + | ++ Owner + | +
|---|---|---|
| + Develop a way for SPRMS to +be compressed and +uncompressed + | ++ + | ++ + | +
| + Override CHPAbstractType +with a concrete class that +exposes attributes with +human readable names + | ++ + | ++ + | +
| + Override PAPAbstractType +with a concrete class that +exposes attributes with +human readable names + | ++ + | ++ + | +
| + Override SEPAbstractType +with a concrete class that +exposes attributes with +human readable names + | ++ + | ++ + | +
| + Override DOPAbstractType +with a concrete class that +exposes attributes with +human readable names + | ++ + | ++ + | +
| + Override TAPAbstractType +with a concrete class that +exposes attributes with +human readable names + | ++ + | ++ + | +
| + Override TCAbstractType +with a concrete class that +exposes attributes with +human readable names + | ++ + | ++ + | +
| + Develop a VerifyIntegrity +class for testing so it is easy +to determine if a Word +Document is well-formed. + | ++ + | ++ + | +
| + Develop general intuitive +API to tie everything together + | ++ + | ++ + | +
Add support for lists and tables
+| + Task + | ++ Target Date + | ++ Owner + | +
|---|---|---|
| + Add data structures for +reading and writing list data +with test cases. + | ++ + | ++ + | +
| + Add data structures for +reading and writing tables +with test cases. + | ++ + | ++ + | +
HWPF 1.0-alpha release with documentation and examples
+| + Task + | ++ Target Date + | ++ Owner + | +
|---|---|---|
| + Document the user model +API + | ++ + | ++ + | +
| + Document the low level +classes + | ++ + | ++ + | +
| + Come up with detailed How-To’s + | ++ + | ++ + | +
XWPF has a fairly stable core API, providing read and write access + to the main parts of a Word .docx file, but it isn't complete. For + some things, it may be necessary to dive down into the low level XMLBeans + objects to manipulate the ooxml structure. If you find yourself having + to do this, please consider sending in a patch to enhance that, see the + "Contribution to POI" page.
+ +For basic text extraction, make use of
+org.apache.poi.xwpf.extractor.XWPFWordExtractor. It accepts an input
+stream or a XWPFDocument. The getText()
+method can be used to
+get the text from all the paragraphs, along with tables, headers etc.
+
To get specific bits of text, first create a
+org.apache.poi.xwpf.XWPFDocument. Select the IBodyElement
+of interest (Table, Paragraph etc), and from there get a XWPFRun.
+Finally fetch the text and properties from that.
+
To get at the headers and footers of a word document, first create a
+org.apache.poi.xwpf.XWPFDocument. Next, you need to create a
+org.apache.poi.xwpf.usermodel.XWPFHeaderFooter, passing it your
+XWPFDocument. Finally, the XWPFHeaderFooter gives you access to the headers and
+footers, including first / even / odd page ones if defined in your
+document.
From a XWPFParagraph, it is possible to fetch the existing
+ XWPFRun elements that make up the text. To add new text,
+ the createRun() method will add a new XWPFRun
+ to the end of the list. insertNewRun(int) can instead be
+ used to add a new XWPFRun at a specific point in the
+ paragraph.
+
Once you have a XWPFRun, you can use the
+ setText(String) method to make changes to the text. To add
+ whitespace elements such as tabs and line breaks, it is necessary to use
+ methods like addTab() and addCarriageReturn().
+
For now, there are a limited number of XWPF examples in the + Examples Package. + Beyond those, the best source of additional examples is in the unit + tests. + Browse the XWPF unit tests. +
+HWPF is still in early development. It is in the + scratchpad section of the SVN. You will need to ensure you + either have a recent SVN checkout, or a recent SVN nightly build + (including the scratchpad jar!)
+ +For basic text extraction, make use of
+org.apache.poi.hwpf.extractor.WordExtractor. It accepts an input
+stream or a HWPFDocument. The getText()
+method can be used to
+get the text from all the paragraphs, or getParagraphText()
+can be used to fetch the text from each paragraph in turn. The other
+option is getTextFromPieces(), which is very fast, but
+tends to return things that aren't text from the page. YMMV.
+
To get specific bits of text, first create a
+org.apache.poi.hwpf.HWPFDocument. Fetch the range
+with getRange(), then get paragraphs from that. You
+can then get text and other properties.
+
To get at the headers and footers of a word document, first create a
+org.apache.poi.hwpf.HWPFDocument. Next, you need to create a
+org.apache.poi.hwpf.usermodel.HeaderStores, passing it your
+HWPFDocument. Finally, the HeaderStores gives you access to the headers and
+footers, including first / even / odd page ones if defined in your
+document. Additionally, HeaderStores provides a method for removing
+any macros in the text, which is helpful as many headers and footers
+do end up with macros in them.
It is possible to change the text via
+ insertBefore() and insertAfter()
+ on a Range object (either a Range,
+ Paragraph or CharacterRun).
+ It is also possible to delete a Range.
+ This code will work in many, but not all cases, and patches to
+ improve it are gratefully received!
+
For now, the best source of additional examples is in the unit + tests. + Browse the HWPF unit tests. +
+HMEF is the POI Project's pure Java implementation of Microsoft's + TNEF (Transport Neutral Encoding Format), aka winmail.dat, + which is used by Outlook and Exchange in some situations.
+Currently, HMEF provides a read-only api for accessing common + message and attachment attributes, including the message body + and attachment files. In addition, it's possible to have + read-only access to all of the underlying TNEF and MAPI + attributes of the message and attachments.
+HMEF also provides a command line tool for extracting out + the message body and attachment files from a TNEF (winmail.dat) + file.
+Write support, both for saving changes and for creating new + files, is currently unavailable. Anyone interested in working + on these areas is advised to read the + Contribution Guidelines then + join the dev list!
+ +The class org.apache.poi.hmef.extractor.HMEFContentsExtractor + provides both command line and Java extraction. It allows the + saving of the message body (an RTF file), and all of the + attachment files, to a single directory as specified.
+ +From the command line, simply call the class specifying the + TNEF file to extract, and the directory to place the extracted + files into, eg:
+From Java, there are two method calls on the class, one to + extract the message body RTF to a file, and the other to extract + all the attachments to a directory. A typical use would be:
+To get at your attachments, simply call the + getAttachments() method on a HMEFMessage + instance, and you'll receive a list of all the attachments.
+When you have a org.apache.poi.hmef.Attachment object, + there are several helper methods available. These will all + return the value of the appropriate underlying attachment + attributes, or null if for some reason the attribute isn't + present in your file.
+A org.apache.poi.hmef.HMEFMessage instance is created + from an InputStream of the underlying TNEF (winmail.dat) + file.
+From a HMEFMessage, there are three main methods of + interest to call:
+Both Messages and Attachments contain two kinds of attributes. + These are TNEFAttribute and MAPIAttribute.
+TNEFAttribute is specific to TNEF files in terms of the + available types and properties. In general, Attachments have a + few more useful ones of these then Messages.
+MAPIAttributes hold standard MAPI properties and values, and + work in a similar way to HSMF + (Outlook) does. There are typically many of these on both + Messages and Attachments. Note - see limitations
+Both HMEFMessage and Attachment supports + support two different ways of getting to attributes of interest. + Firstly, they support list getters, to return all attributes + (either TNEF or MAPI). Secondly, they support specific getters by + TNEF or MAPI property.
+To get a feel for the contents of a file, and to track down + where data of interest is stored, HMEF comes with + HMEFDumper + to print out the contents of the file.
+HMEF is currently a work-in-progress, and not everything + works yet. The current limitations are:
++ The file is made up of a number of POIFS streams. A typical + file will be made up as follows: +
+If you make a change to the text of a file, but not change + how much text there is, then the CONTENTS stream + will undergo a small change, and the Contents stream + will undergo a large change.
+If you make a change to the text of a file, and change the + amount of text there is, then both the Contents and + the CONTENTS streams change.
+If you alter the size of a textbox, but make no text changes, + then both Contents and CONTENTS streams + change. There are no changes to the Escher streams.
+If you set the background colour of a textbox, but make + no changes to the text, (to finish off)
+First we have "CHNKINK ", followed by 24 bytes.
+Next we have 20 sequences of 24 bytes each. If the first two bytes + at 0x1800, then that sequence entry exists, but if it's 0x0000 then + the entry doesn't exist. If it does exist, we then have 4 bytes of + upper case ASCII text, followed by three little endian shorts. + The first of these seems to be the count of that type, the second is + usually 1, the third is usually zero. The we have another 4 bytes of + upper case ASCII text, normally but not always the same as the first + text. Finally, we have an unsigned little endian 32 bit offset to + the start of the data for this, then an unsigned little endian + 32 bit offset of the length of this section.
+Normally, the first sequence entry is for TEXT, and the text data + will start at 0x200. After that is normally two or three STSH entries + (so the first short has values 0, then 1, then 2). After that it + seems to vary.
+At 0x200 we have the text, stored as little endian 16 bit unicode.
+After the text comes all sorts of other stuff, presumably as + described by the sequences.
+For a contents stream of length 7168 / 0x1c00 bytes, the start + looks something like:
+We think that the first 4 bytes of text describes the + the function of the data at the offset. The first short is + then the count of that type, eg the 2nd will have 1. We + think that the second 4 bytes of text describes the format + of data block at the offset. The format of the text block + is easy, but we're still trying to figure out the others.
+ +This is very simple. All the text for the document is + stored in a single bit of the Quill CONTENTS. The text + is stored as little endian 16 bit unicode strings.
+The first four bytes seem to hold the count of the + entries in the bit, and the second four bytes seem to hold + the type. There is then some pre-data, and then data for + each of the entries, the exact format dependant on the type.
+Type 0 has 4 2 byte unsigned ints, then a pair of 2 byte + unsigned ints for each entry.
+Type 4 has 4 2 byte unsigned ints, then a pair of 4 byte + unsigned ints for each entry.
+Type 8 has 7 2 byte unsigned ints, then a pair of 4 byte + unsigned ints for each entry.
+Type 12 holds hyperlinks, and is very much more complex.
+ See org.apache.poi.hpbf.model.qcbits.QCPLCBit
+ for our best guess as to how the contents match up.
HPBF is the POI Project's pure Java implementation of the + Publisher file format.
+Currently, HPBF is in an early stage, whilst we try to + figure out the file format. So far, we have basic text + extraction support, and are able to read some parts within + the file. Writing is not yet supported, as we are unable + to make sense of the Contents stream, which we think has + lots of offsets to other parts of the file.
+Our initial aim is to produce a text extractor for the format + (now done), and be able to extract hyperlinks from within + the document (partly supported). Additional low level + code to process the file format may follow, if there + is demand and developer interest warrants it.
+Text Extraction is available via the + org.apache.poi.hpbf.extractor.PublisherTextExtractor + class.
+At this time, there is no usermodel api or similar. + There is only low level support for certain parts of + the file, but by no means all of it.
+Our current understanding of the file format is documented + here.
+As of 2017, we are unaware of a public format specification for + Microsoft Publisher .pub files. This format was not included in + the Microsoft Open Specifications Promise with the rest of the + Microsoft Office file formats. + As of 2009 and 2016, Microsoft had no plans to document the .pub file format. + If this changes in the future, perhaps we will see a spec published + on the Microsoft Office File Format Open Specification Technical Documentation. +
+ +This HOW-TO is organized in four sections. You should read them + sequentially because the later sections build upon the earlier ones.
+ +If all you are interested in is getting the textual content of
+ all the document properties, such as for full text indexing, then
+ take a look at
+ org.apache.poi.hpsf.extractor.HPSFPropertiesExtractor. However,
+ if you want full access to the properties, please read on!
The first thing you should understand is that a Microsoft Office file is + not one large bunch of bytes but has an internal filesystem structure with + files and directories. You can access these files and directories using + the POI filesystem (POIFS) + provides. A file or document in a POI filesystem is also called a + stream - The properties of, say, an Excel document are + stored apart of the actual spreadsheet data in separate streams. The good + new is that this separation makes the properties independent of the + concrete Microsoft Office file. In the following text we will always say + "POI filesystem" instead of "Microsoft Office file" because a POI + filesystem is not necessarily created by or for a Microsoft Office + application, because it is shorter, and because we want to avoid the name + of That Redmond Company.
+ +The following example shows how to read the "title" property. Reading
+ other properties is similar. Consider the API documentation of the class
+ org.apache.poi.hpsf.SummaryInformation to learn which methods
+ are available.
The standard properties this section focuses on can be found in a + document called \005SummaryInformation located in the root of the + POI filesystem. The notation \005 in the document's name means + the character with a decimal value of 5. In order to read the "title" + property, an application has to perform the following steps:
+ +SummaryInformation from
+ that document.
+ SummaryInformation instance's
+ getTitle() method.
+ Sounds easy, doesn't it? Here are the steps in detail.
+ + +An application that wants to open a document in a POI filesystem + (POIFS) proceeds as shown by the following code fragment. The full + source code of the sample application is available in the + examples section of the POI source tree as + ReadTitle.java.
+ +The first interesting statement is
+ +It creates a
+ org.apache.poi.poifs.eventfilesystem.POIFSReader instance
+ which we shall need to read the POI filesystem. Before the application
+ actually opens the POI filesystem we have to tell the
+ POIFSReader which documents we are interested in. In this
+ case the application should do something with the document
+ \005SummaryInformation.
This method call registers a
+ org.apache.poi.poifs.eventfilesystem.POIFSReaderListener
+ with the POIFSReader. The POIFSReaderListener
+ interface specifies the method processPOIFSReaderEvent()
+ which processes a document. The class
+ MyPOIFSReaderListener implements the
+ POIFSReaderListener and thus the
+ processPOIFSReaderEvent() method. The eventing POI
+ filesystem calls this method when it finds the
+ \005SummaryInformation document. In the sample application
+ MyPOIFSReaderListener is a static class in the
+ ReadTitle.java source file.
Now everything is prepared and reading the POI filesystem can + start:
+ +The following source code fragment shows the
+ MyPOIFSReaderListener class and how it retrieves the
+ title.
The line
+ +declares a SummaryInformation variable and initializes it
+ with null. We need an instance of this class to access the
+ title. The instance is created in a try block:
The expression event.getStream() returns the input stream
+ containing the bytes of the property set stream named
+ \005SummaryInformation. This stream is passed into the
+ create method of the factory class
+ org.apache.poi.hpsf.PropertySetFactory which returns
+ a org.apache.poi.hpsf.PropertySet instance. It is more or
+ less safe to cast this result to SummaryInformation, a
+ convenience class with methods like getTitle(),
+ getAuthor() etc.
The PropertySetFactory.create() method may throw all sorts
+ of exceptions. We'll deal with them in the next sections. For now we just
+ catch all exceptions and throw a RuntimeException
+ containing the message text of the origin exception.
If all goes well, the sample application retrieves the title and prints + it to the standard output. As you can see you must be prepared for the + case that the POI filesystem does not have a title.
+ +Please note that a POI filesystem does not necessarily contain the
+ \005SummaryInformation stream. The documents created by the
+ Microsoft Office suite have one, as far as I know. However, an Excel
+ spreadsheet exported from StarOffice 5.2 won't have a
+ \005SummaryInformation stream. In this case the applications
+ won't throw an exception but simply does not call the
+ processPOIFSReaderEvent method. You have been warned!
A couple of additional standard properties are not + contained in the \005SummaryInformation stream explained + above. Examples for such properties are a document's category or the + number of multimedia clips in a PowerPoint presentation. Microsoft has + invented an additional stream named + \005DocumentSummaryInformation to hold these properties. With two + minor exceptions you can proceed exactly as described above to read the + properties stored in \005DocumentSummaryInformation:
+ +SummaryInformation by
+ DocumentSummaryInformation.And of course you cannot call getTitle() because
+ DocumentSummaryInformation has different query methods,
+ e.g. getCategory. See the Javadoc API documentation for the
+ details.
In the previous section the application simply caught all + exceptions and was in no way interested in any + details. However, a real application will likely want to know what went + wrong and act appropriately. Besides any I/O exceptions there are three + HPSF resp. POI specific exceptions you should know about:
+ +NoPropertySetStreamException:PropertySet instance from a stream that is not a
+ property set stream. (SummaryInformation and
+ DocumentSummaryInformation are subclasses of
+ PropertySet.) A faulty property set stream counts as not
+ being a property set stream at all. An application should be prepared to
+ deal with this case even if it opens streams named
+ \005SummaryInformation or
+ \005DocumentSummaryInformation. These are just names. A
+ stream's name by itself does not ensure that the stream contains the
+ expected contents and that this contents is correct.
+ UnexpectedPropertySetTypeExceptionSummaryInformation or
+ DocumentSummaryInformation) but the provided property
+ set is not of that type.MarkUnsupportedExceptionInputStream.mark(int) operation. The POI filesystem uses
+ the DocumentInputStream class which does support this
+ operation, so you are safe here. However, if you read a property set
+ stream from another kind of input stream things may be
+ different.Many Microsoft Office documents contain embedded
+ objects, for example an Excel sheet within a Word
+ document. Embedded objects may have property sets of their own. An
+ application can open these property set streams as described above. The
+ only difference is that they are not located in the POI filesystem's root
+ but in a nested directory instead. Just register a
+ POIFSReaderListener for the property set streams you are
+ interested in.
As explained above, standard properties are located in the summary
+ information and document summary information streams of typical POI
+ filesystems. You have already learned about the classes
+ SummaryInformation and
+ DocumentSummaryInformation and their get...()
+ methods for reading standard properties. These classes also provide
+ set...() methods for writing properties.
After setting properties in SummaryInformation or
+ DocumentSummaryInformation you have to write them to a disk
+ file. The following sample program shows how you can
The complete source code of this program is available as + ModifyDocumentSummaryInformation.java in the examples + section of the POI source tree.
+ +set...() methods of the class
+ SummaryInformation.The first step is to read the POI filesystem into memory:
+ +The code snippet above assumes that the variable
+ poiFilesystem holds the name of a disk file. It reads the
+ file from an input stream and creates a POIFSFileSystem
+ object in memory. After having read the file, the input stream should be
+ closed as shown.
In order to read the document summary information stream the application
+ must open the element \005DocumentSummaryInformation in the POI
+ filesystem's root directory. However, the POI filesystem does not
+ necessarily contain a document summary information stream, and the
+ application should be able to deal with that situation. The following
+ code does so by creating a new DocumentSummaryInformation if
+ there is none in the POI filesystem:
In the source code above the statement
+ +gets hold of the POI filesystem's root directory as a
+ DirectoryEntry. The getEntry() method of this
+ class is used to access a file or directory entry in a directory. However,
+ if the file to be opened does not exist, a
+ FileNotFoundException will be thrown. Therefore opening the
+ document summary information entry should be done in a try
+ block:
DocumentSummaryInformation.DEFAULT_STREAM_NAME represents
+ the string "\005DocumentSummaryInformation", i.e. the standard name of a
+ document summary information stream. If this stream exists, the
+ getEntry() method returns a DocumentEntry. To
+ read the DocumentEntry's contents, create a
+ DocumentInputStream:
Up to this point we have used POI's POIFS component. Now HPSF enters the + stage. A property set is created from the input stream's data:
+ +If the data really constitutes a property set, a
+ PropertySet object is created. Otherwise a
+ NoPropertySetStreamException is thrown. After having read the
+ data from the input stream the latter should be closed.
Since we know - or at least hope - that the stream named
+ "\005DocumentSummaryInformation" is not just any property set but really
+ contains the document summary information, we try to create a new
+ DocumentSummaryInformation from the property set. If the
+ stream is not document summary information stream the sample application
+ fails with a UnexpectedPropertySetTypeException.
If the POI document does not contain a document summary information
+ stream, we can create a new one in the catch clause. The
+ PropertySetFactory's method
+ newDocumentSummaryInformation() establishes a new and empty
+ DocumentSummaryInformation instance:
Whether we read the document summary information from the POI filesystem
+ or created it from scratch, in either case we now have a
+ DocumentSummaryInformation instance we can write to. Writing
+ is quite simple, as the following line of code shows:
This statement sets the "category" property to "POI example". Any + former "category" value will be lost. If there hasn't been a "category" + property yet, a new one will be created.
+ +DocumentSummaryInformation of course has methods to set the
+ other standard properties, too - look into the API documentation to see
+ all of them.
Once all properties are set as needed, they should be stored into the
+ file on disk. The first step is to write the
+ DocumentSummaryInformation into the POI filesystem:
The DocumentSummaryInformation's write()
+ method takes two parameters: The first is the DirectoryEntry
+ in the POI filesystem, the second is the name of the stream to create in
+ the directory. If this stream already exists, it will be overwritten.
Still the POI filesystem is a data structure in memory only and must be + written to a disk file to make it permanent. The following lines write + back the POI filesystem to the file it was read from before. Please note + that in production-quality code you should never write directly to the + origin file, because in case of an error everything would be lost. Here it + is done this way to keep the example short.
+ +If you compare the source code excerpts above with the file containing + the full source code, you will notice that I left out some following + lines of code. The are dealing with the special topic of custom + properties.
+ +Custom properties are properties the user can define himself. Using for + example Microsoft Word he can define these extra properties and give + each of them a name, a type and a + value. The custom properties are stored in the document + information summary along with the standard properties.
+ +The source code example shows how to retrieve the custom properties
+ as a whole from a DocumentSummaryInformation instance using
+ the getCustomProperties() method. The result is a
+ CustomProperties instance or null if no
+ user-defined properties exist.
Since CustomProperties implements the Map
+ interface you can read and write properties with the usual
+ Map methods. However, CustomProperties poses
+ some restrictions on the types of keys and values.
String,
+ Boolean, Long, Integer,
+ Short, or java.util.Date.The CustomProperties class has been designed for easy
+ access using just keys and values. The underlying Microsoft-specific
+ custom properties data structure is more complicated. However, it does
+ not provide noteworthy additional benefits. It is possible to have
+ multiple properties with the same name or properties without a
+ name at all. When reading custom properties from a document summary
+ information stream, the CustomProperties class ignores
+ properties without a name and keeps only the "last" (whatever that means)
+ of those properties having the same name. You can find out whether a
+ CustomProperties instance dropped any properties with the
+ isPure() method.
You can read and write the full spectrum of custom properties with + HPSF's low-level methods. They are explained in the next section.
+Now comes the real hardcode stuff. As mentioned above,
+ SummaryInformation and
+ DocumentSummaryInformation are just special cases of the
+ general concept of a property set. This concept says that a
+ property set consists of properties and that each
+ property is an entity with an ID, a
+ type, and a value.
Okay, that was still rather easy. However, to make things more + complicated, Microsoft in its infinite wisdom decided that a property set + shalt be broken into one or more sections. Each section + holds a bunch of properties. But since that's still not complicated + enough, a section may have an optional dictionary that + maps property IDs to property names - we'll explain + later what that means.
+ +The procedure to get to the properties is the following:
+ +PropertySetFactory class to
+ create a PropertySet object from a property set stream. If
+ you don't know whether an input stream is a property set stream, just
+ try to call PropertySetFactory.create(java.io.InputStream):
+ You'll either get a PropertySet instance returned or an
+ exception is thrown.PropertySet's method getSections()
+ to get the sections contained in the property set. Each section is
+ an instance of the Section class.F29F85E0-4FF9-1068-AB-91-08-00-2B-27-B3-D9. You can
+ get the format ID with Section.getFormatID().Section can be retrieved
+ with Section.getProperties(). The result is an array of
+ Property instances.Property
+ class has methods to retrieve them.Let's have a look at a sample Java application that dumps all property + set streams contained in a POI file system. The full source code of this + program can be found as ReadCustomPropertySets.java in the + examples area of the POI source code tree. Here are the key + sections:
+ +The most important package the application needs is
+ org.apache.poi.hpsf.*. This package contains the HPSF
+ classes. Most classes named below are from the HPSF package. Of course we
+ also need the POIFS event file system's classes and java.io.*
+ since we are dealing with POI I/O. From the java.util package
+ we use the List and Iterator class. The class
+ org.apache.poi.util.HexDump provides a methods to dump byte
+ arrays as nicely formatted strings.
The POIFSReader is set up in a way that the listener
+ MyPOIFSReaderListener is called on every file in the POI file
+ system.
The listener class tries to create a PropertySet from each
+ stream using the PropertySetFactory.create() method:
Creating the PropertySet is done in a try
+ block, because not each stream in the POI file system contains a property
+ set. If it is some other file, the
+ PropertySetFactory.create() throws a
+ NoPropertySetStreamException, which is caught and
+ logged. Then the program continues with the next stream. However, all
+ other types of exceptions cause the program to terminate by throwing a
+ runtime exception. If all went well, we can print the name of the property
+ set stream.
The next step is to print the number of sections followed by the + sections themselves:
+ +The PropertySet's method getSectionCount()
+ returns the number of sections.
To retrieve the sections, use the getSections()
+ method. This method returns a java.util.List containing
+ instances of the Section class in their proper order.
The sample code shows a loop that retrieves the Section
+ objects one by one and prints some information about each one. Here is
+ the complete body of the loop:
The first method called on the Section instance is
+ getFormatID(). As explained above, the format ID of the
+ first section in a property set determines the type of the property
+ set. Its type is ClassID which is essentially a sequence of
+ 16 bytes. A real application using its own type of a custom property set
+ should have defined a unique format ID and, when reading a property set
+ stream, should check the format ID is equal to that unique format ID. The
+ sample program just prints the format ID it finds in a section:
As you can see, the getFormatID() method returns a
+ ClassID object. An array containing the bytes can be
+ retrieved with ClassID.getBytes(). In order to get a nicely
+ formatted printout, the sample program uses the hex() helper
+ method which in turn uses the POI utility class HexDump in
+ the org.apache.poi.util package. Another helper method is
+ out() which just saves typing
+ System.out.println().
Before getting the properties, it is possible to find out how many
+ properties are available in the section via the
+ Section.getPropertyCount(). The sample application uses this
+ method to print the number of properties to the standard output:
Now its time to get to the properties themselves. You can retrieve a
+ section's properties with the method
+ Section.getProperties():
As you can see the result is an array of Property
+ objects. This class has three methods to retrieve a property's ID, its
+ type, and its value. The following code snippet shows how to call
+ them:
The output of the sample program might look like the following. It
+ shows the summary information and the document summary information
+ property sets of a Microsoft Word document. However, unlike the first and
+ second section of this HOW-TO the application does not have any code
+ which is specific to the SummaryInformation and
+ DocumentSummaryInformation classes.
There are some interesting items to note:
+ +Properties in the same section are distinguished by their IDs. This is + similar to variables in a programming language like Java, which are + distinguished by their names. But unlike variable names, property IDs are + simple integral numbers. There is another similarity, however. Just like + a Java variable has a certain scope (e.g. a member variables in a class), + a property ID also has its scope of validity: the section.
+ +Two property IDs in sections with different section format IDs + don't have the same meaning even though their IDs might be equal. For + example, ID 4 in the first (and only) section of a summary + information property set denotes the document's author, while ID 4 in the + first section of the document summary information property set means the + document's byte count. The sample output above does not show a property + with an ID of 4 in the first section of the document summary information + property set. That means that the document does not have a byte + count. However, there is a property with an ID of 4 in the + second section: This is a user-defined property ID - we'll get + to that topic in a minute.
+ +So, how can you find out what the meaning of a certain property ID in
+ the summary information and the document summary information property set
+ is? The standard property sets as such don't have any hints about the
+ meanings of their property IDs. For example, the summary
+ information property set does not tell you that the property ID 4 stands
+ for the document's author. This is external knowledge. Microsoft defined
+ standard meanings for some of the property IDs in the summary information
+ and the document summary information property sets. As a help to the Java
+ and POI programmer, the class PropertyIDMap in the
+ org.apache.poi.hpsf.wellknown package defines constants
+ for the "well-known" property IDs. For example, there is the
+ definition
These definitions allow you to use symbolic names instead of + numbers.
+ +In order to provide support for the other way, too, - i.e. to map
+ property IDs to property names - the class PropertyIDMap
+ defines two static methods:
+ getSummaryInformationProperties() and
+ getDocumentSummaryInformationProperties(). Both return
+ java.util.Map objects which map property IDs to
+ strings. Such a string gives a hint about the property's meaning. For
+ example,
+ PropertyIDMap.getSummaryInformationProperties().get(4)
+ returns the string "PID_AUTHOR". An application could use this string as
+ a key to a localized string which is displayed to the user, e.g. "Author"
+ in English or "Verfasser" in German. HPSF might provide such
+ language-dependent ("localized") mappings in a later release.
Usually you won't have to deal with those two maps. Instead you should
+ call the Section.getPIDString(int) method. It returns the
+ string associated with the specified property ID in the context of the
+ Section object.
Above you learned that property IDs have a meaning in the scope of a + section only. However, there are two exceptions to the rule: The property + IDs 0 and 1 have a fixed meaning in all sections:
+ +| Property ID | +Meaning | +
|---|---|
| 0 | +The property's value is a dictionary, i.e. a + mapping from property IDs to strings. | +
| 1 | +The property's value is the number of a codepage, + i.e. a mapping from character codes to characters. All strings in the + section containing this property must be interpreted using this + codepage. Typical property values are 1252 (8-bit "western" characters, + ISO-8859-1), 1200 (16-bit Unicode characters, UFT-16), or 65001 (8-bit + Unicode characters, UFT-8). | +
A property is nothing without its value. It is stored in a property set
+ stream as a sequence of bytes. You must know the property's
+ type in order to properly interpret those bytes and
+ reasonably handle the value. A property's type is one of the so-called
+ Microsoft-defined "variant types". When you call
+ Property.getType() you'll get a long value
+ which denoting the property's variant type. The class
+ Variant in the org.apache.poi.hpsf package
+ holds most of those long values as named constants. For
+ example, the constant VT_I4 = 3 means a signed integer value
+ of four bytes. Examples of other types are VT_LPSTR = 30
+ meaning a null-terminated string of 8-bit characters, VT_LPWSTR =
+ 31 which means a null-terminated Unicode string, or VT_BOOL
+ = 11 denoting a boolean value.
In most cases you won't need a property's type because HPSF does all + the work for you.
+When an application wants to retrieve a property's value and calls
+ Property.getValue(), HPSF has to interpret the bytes making
+ out the value according to the property's type. The type determines how
+ many bytes the value consists of and what
+ to do with them. For example, if the type is VT_I4, HPSF
+ knows that the value is four bytes long and that these bytes
+ comprise a signed integer value in the little-endian format. This is
+ quite different from e.g. a type of VT_LPWSTR. In this case
+ HPSF has to scan the value bytes for a Unicode null character and collect
+ everything from the beginning to that null character as a Unicode
+ string.
The good new is that HPSF does another job for you, too: It maps the + variant type to an adequate Java type.
+ +| Variant type: | +Java type: | +
|---|---|
| VT_I2 | +java.lang.Integer | +
| VT_I4 | +java.lang.Long | +
| VT_FILETIME | +java.util.Date | +
| VT_LPSTR | +java.lang.String | +
| VT_LPWSTR | +java.lang.String | +
| VT_CF | +byte[] | +
| VT_BOOL | +java.lang.Boolean | +
The bad news is that there are still a couple of variant types HPSF + does not yet support. If it encounters one of these types it + returns the property's value as a byte array and leaves it to be + interpreted by the application.
+ +An application retrieves a property's value by calling the
+ Property.getValue() method. This method's return type is the
+ abstract Object class. The getValue() method
+ looks up the property's variant type, reads the property's value bytes,
+ creates an instance of an adequate Java type, assigns it the property's
+ value and returns it. Primitive types like int or
+ long will be returned as the corresponding class,
+ e.g. Integer or Long.
The property with ID 0 has a very special meaning: It is a + dictionary mapping property IDs to property names. We + have seen already that the meanings of standard properties in the + summary information and the document summary information property sets + have been defined by Microsoft. The advantage is that the labels of + properties like "Author" or "Title" don't have to be stored in the + property set. However, a user can define custom fields in, say, Microsoft + Word. For each field the user has to specify a name, a type, and a + value.
+ +The names of the custom-defined fields (i.e. the property names) are + stored in the document summary information second section's + dictionary. The dictionary is a map which associates + property IDs with property names.
+ +The method Section.getPIDString(int) not only returns with
+ the well-known property names of the summary information and document
+ summary information property sets, but with self-defined properties,
+ too. It should also work with self-defined properties in self-defined
+ sections.
The property with ID 1 holds the number of the codepage which was used + to encode the strings in this section. If this property is not available + in a section, the platform's default character encoding will be + used. This works fine as long as the document being read has been written + on a platform with the same default character encoding. However, if you + receive a document from another region of the world and the codepage is + undefined, you are in trouble.
+ +HPSF's codepage support is only as good as the character encoding + support of the Java Virtual Machine (JVM) the application runs on. If + HPSF encounters a codepage number it assumes that the JVM has a character + encoding with a corresponding name. For example, if the codepage is 1252, + HPSF uses the character encoding "cp1252" to read or write strings. If + the JVM does not have that character encoding installed or if the + codepage number is illegal, an UnsupportedEncodingException will be + thrown. This works quite well with Java 2 Standard Edition (J2SE) + versions since 1.4. However, under J2SE 1.3 or lower you are out of + luck. You should install a newer J2SE version to process codepages with + HPSF.
+ +There are some exceptions to the rule saying that a character + encoding's name is derived from the codepage number by prepending the + string "cp" to it. In these cases the codepage number is mapped to a + well-known character encoding name. Here are a few examples:
+ +More of these mappings between codepage and character encoding name are
+ hard-coded in the classes org.apache.poi.hpsf.Constants and
+ org.apache.poi.hpsf.VariantSupport. Probably there will be a
+ need to add more mappings. The HPSF author will appreciate any hints.
Writing properties is possible at a high level and at a low level:
+ +HPSF's writing capabilities come with the classes
+ PropertySet, Section,
+ Property, and some helper classes.
When you are going to write a property set stream your application has + to perform the following steps:
+ +PropertySet instance.Section. You can either retrieve
+ the one that is always present in a new PropertySet,
+ or you have to create a new Section and add it to
+ the PropertySet.
+ Section fields as you like.Property objects as you need. Set
+ each property's ID, type, and value. Add the
+ Property objects to the Section.
+ Sections if you need them.PropertySet.toInputStream() and write it to a POIFS
+ document.Writing properties is introduced by an artificial but simple example: a + program creating a new document (aka POI file system) which contains only + a single document: a summary information property set stream. The latter + will hold the document's title only. This is artificial in that it does + not contain any Word, Excel or other kind of useful application document + data. A document containing just a property set is without any practical + use. However, it is perfectly fine for an example because it make it very + simple and easy to understand, and you will get used to writing + properties in real applications quickly.
+ +The application expects the name of the POI file system to be written + on the command line. The title property it writes is "Sample title".
+ +Here's the application's source code. You can also find it in the + "examples" section of the POI source code distribution. Explanations are + following below.
+ +The application first checks that there is exactly one single argument
+ on the command line: the name of the file to write. If this single
+ argument is present, the application stores it in the
+ fileName variable. It will be used in the end when the POI
+ file system is written to a disk file.
Let's create a property set now. We cannot use the
+ PropertySet class, because it is read-only. It does not have
+ a constructor creating an empty property set, and it does not have any
+ methods to modify its contents, i.e. to write sections containing
+ properties into it.
The class to use is PropertySet. The sample application calls its no-args
+ constructor in order to establish an empty property set:
As said, we have an empty property set now. Later we will put some + contents into it.
+ +The PropertySet created by the no-args constructor
+ is not really empty: It contains a single section without properties. We
+ can either retrieve that section and fill it with properties or we can
+ replace it by another section. We can also add further sections to the
+ property set. The sample application decides to retrieve the section
+ being already there:
The getSections() method returns the property set's
+ sections as a list, i.e. an instance of
+ java.util.List. Calling get(0) returns the
+ list's first (or zeroth, if you prefer) element.
The alternative to retrieving the Section being
+ already there would have been to create an new
+ Section like this:
The Section the sample application retrieved from
+ the PropertySet is still empty. It contains no
+ properties and does not have a format ID. As you have read above the format ID of the first section in a
+ property set determines the property set's type. Since our property set
+ should become a SummaryInformation property set we have to set the format
+ ID of its first (and only) section to
+ F29F85E0-4FF9-1068-AB-91-08-00-2B-27-B3-D9. However, you
+ won't have to remember that ID: HPSF has it defined as the well-known
+ constant SectionIDMap.SUMMARY_INFORMATION_ID. The sample
+ application writes it to the section using the
+ setFormatID(byte[]) method:
A Property object must have an ID, a type, and a
+ value (see above for details). The class
+ provides methods to set these attributes:
The Property class has a constructor which you can
+ use to pass in all three attributes in a single call. See the Javadoc API
+ documentation for details!
The sample property set is complete now. We have a
+ PropertySet containing a Section
+ containing a Property. Of course we could have added
+ more sections to the property set and more properties to the sections but
+ we wanted to keep things simple.
The property set has to be written to a POI file system. The following + statement creates it.
+ +Writing the property set includes the step of converting it into a
+ sequence of bytes. The PropertySet class has the
+ method toInputStream() for this purpose. It returns the
+ bytes making out the property set stream as an
+ InputStream:
If you'd read from this input stream you'd receive all the property
+ set's bytes. However, it is very likely that you'll never do
+ that. Instead you'll pass the input stream to the
+ POIFSFileSystem.createDocument() method, like this:
Besides the InputStream createDocument()
+ takes a second parameter: the name of the document to be created. For a
+ SummaryInformation property set stream the default name is available as
+ the constant SummaryInformation.DEFAULT_STREAM_NAME.
The last step is to write the POI file system to a disk file:
+ +There are still some aspects of HSPF left which are not covered by this + HOW-TO. You should dig into the Javadoc API documentation to learn + further details. Since you've struggled through this document up to this + point, you are well prepared.
+Microsoft applications like "Word", "Excel" or "Powerpoint" let the user + describe a document by properties like "title", "category" and so on. The + application itself adds further information: last author, creation date + etc. These document properties are stored in property set + streams. A property set stream is a separate document within a + POI filesystem. HPSF is POI's pure-Java + implementation to read and write property sets.
+ +The HPSF HOWTO describes what a Java + application should do to read a property set using HPSF, how to retrieve + the information it needs, and how to write properties into the + document.
+ +HPSF supports OLE2 property set streams in general, and is not limited to + the special case of document properties in the Microsoft Office files + mentioned above. The HPSF description + describes the internal structure of property set streams. A separate + document explains the internal of thumbnail + images.
+A Microsoft Office document is internally organized like a filesystem
+ with directory and files. Microsoft calls these files
+ streams. A document can have properties attached to it,
+ like author, title, number of words etc. These metadata are not stored in
+ the main stream of, say, a Word document, but instead in a dedicated
+ stream with a special format. Usually this stream's name is
+ \005SummaryInformation, where \005 represents
+ the character with a decimal value of 5.
A single piece of information in the stream is called a + property, for example the document title. Each property + has an integral ID (e.g. 2 for title), a + type (telling that the title is a string of bytes) and a + value (what this is should be obvious). A stream + containing properties is called a + property set stream.
+ +This document describes the internal structure of a property set stream, + i.e. the HPSF. It does + not describe how a Microsoft Office document is organized internally and + how to retrieve a stream from it. See the POIFS documentation for that kind of + stuff.
+ +The HPSF is not only used in the Summary
+ Information stream in the top-level document of a Microsoft Office
+ document. Often there is also a property set stream named
+ \005DocumentSummaryInformation with additional properties.
+ Embedded documents may have their own property set streams. You cannot
+ tell by a stream's name whether it is a property set stream or not.
+ Instead you have to open the stream and look at its bytes.
Before delving into the details of the property set stream format we + have to have a short look at data types. Integral values are stored in the + so-called little endian format. In this format the bytes + that make out an integral value are stored in the "wrong" order. For + example, the decimal value 4660 is 0x1234 in the hexadecimal notation. If + you think this should be represented by a byte 0x12 followed by another + byte 0x34, you are right. This is called the big endian + format. In the little endian format, however, this order is reversed and + the low-value byte comes first: 0x3412. +
+ +The following table gives an overview about some important data + types:
+ +| Name | +Length | +Example (Big Endian) | +Example (Little Endian) | +
|---|---|---|---|
| Bytes | +1 byte | +0x12 |
+ 0x12 |
+
| Word | +2 bytes | +0x1234 |
+ 0x3412 |
+
| DWord | +4 bytes | +0x12345678 |
+ 0x78563412 |
+
| ClassID + A sequence of one DWord, two Words and eight Bytes |
+
+ 16 bytes | + +0xE0859FF2F94F6810AB9108002B27B3D9 resp.
+ E0859FF2-F94F-6810-AB-91-08-00-2B-27-B3-D9 |
+
+ 0xF29F85E04FF91068AB9108002B27B3D9 resp.
+ F29F85E0-4FF9-1068-AB-91-08-00-2B-27-B3-D9 |
+
| + | + | The ClassID examples are given here in two different notations. The + second notation without the "0x" at the beginning and with dashes + inside shows the internal grouping into one DWord, two Words and eight + Bytes. | +Watch out: Microsoft documentation and tools show class IDs
+ a little bit differently like
+ F29F85E0-4FF9-1068-AB91-08002B27B3D9.
+ However, that representation is (intentionally?) misleading with
+ respect to endianess. |
+
A property set stream consists of three main parts:
+ +The first bytes in a property set stream is the header. + It has a fixed length and looks like this:
+ +| Offset | +Type | +Contents | +Remarks | +
|---|---|---|---|
| 0 | +Word | +0xFFFE |
+ If the first four bytes of a stream do not contain these values, the + stream is not a property set stream. | +
| 2 | +Word | +0x0000 |
+ + |
| 4 | +DWord | +Denotes the operating system and the OS version under which this
+ stream was created. The operating system ID is in the DWord's higher
+ word (after little endian decoding): 0x0000 for Win16,
+ 0x0001 for Macintosh and 0x0002 for Win32 -
+ that's all. The reader is most likely aware of the fact that there are
+ some more operating systems. However, Microsoft does not seem to
+ know. |
+ + |
| 8 | +ClassID | +0x00000000000000000000000000000000 |
+ Most property set streams have this value but this is not + required. | +
| 24 | +DWord | +0x01000000 or greater |
+ Section count. This field's value should be equal to 1 or greater. + Microsoft claims that this is a "reserved" field, but it seems to tell + how many sections (see below) are following in the stream. This would + really make sense because otherwise you could not know where and how + far you should read section data. | +
Following the header is the section list. This is an array of pairs each + consisting of a section format ID and an offset. This array has as many + pairs of ClassID and and DWord fields as the section count field in the + header says. The Summary Information stream contains a single section, the + Document Summary Information stream contains two.
+ +| Type | +Contents | +Remarks | +
|---|---|---|
| ClassID | +Section format ID | +0xF29F85E04FF91068AB9108002B27B3D9 for the single section
+ in the Summary Information stream.+ + 0xD5CDD5022E9C101B939708002B2CF9AE for the first
+ section in the Document Summary Information stream. |
+
| DWord | +Offset | +The number of bytes between the beginning of the stream and the + beginning of the section within the stream. | +
| ClassID | +Section format ID | +... | +
| DWord | +Offset | +... | +
| ... | +... | +... | +
A section is divided into three parts: the section header (with the + section length and the number of properties in the section), the + properties list (with type and offset of each property), and the + properties themselves. Here are the details:
+ +| + | Type | +Contents | +Remarks | +
|---|---|---|---|
| Section header | + +DWord | +Length | +The length of the section in bytes. | +
| + | DWord | +Property count | +The number of properties in the section. | +
| Properties list | + +DWord | +Property ID | +The property ID tells what the property means. For example, an ID of
+ 0x0002 in the Summary Information stands for the document's
+ title. See the Property IDs
+ chapter below for more details. |
+
| + | DWord | +Offset | +The number of bytes between the beginning of the section and the + property. | +
| + | ... | +... | +... | +
| Properties | + +DWord | +Property type ("variant") | +This is the property's data type, e.g. an integer value, a byte + string or a Unicode string. See the + Property Types chapter + for details! | +
| + | Field length depends on the property type + ("variant") | +Property value | +This field's length depends on the property's type. These are the
+ bytes that make out the DWord, the byte string or some other data of
+ fixed or variable length. + + The property value's length is always stored in an area which is a + multiple of 4 in length. If the property is shorter, e.g. a byte + string of 13 bytes, the remaining bytes are padded with 0x00
+ bytes. |
+
| + | ... | +... | +... | +
As seen above, a section holds a property list: an array with property + IDs and offsets. The property ID gives each property a meaning. For + example, in the Summary Information stream the property ID 2 says that + this property is the document's title.
+ +If you want to know a property ID's meaning, it is not sufficient to + know the ID itself. You must also know the + section format ID. For example, in the Document Summary + Information stream the property ID 2 means not the document's title but + its category. Due to Microsoft's infinite wisdom the section format ID is + not part of the section. Thus if you have only a section without the + stream it is in, you cannot make any sense of the properties because you + do not know what they mean.
+ +So each section format ID has its own name space of property IDs. + Microsoft defined some "well-known" property IDs for the Summary + Information and the Document Summary Information streams. You can extend + them by your own additional IDs. This will be described below.
+ +The Summary Information stream has a single section with a section
+ format ID of 0xF29F85E04FF91068AB9108002B27B3D9. The following
+ table defines the meaning of its property IDs. Each row associates a
+ property ID with a name and an ID string. (The property
+ type is just for informational purposes given here. As we have
+ seen above, the type is always given along with the value.)
The property name is a readable string which could be + displayed to the user. However, this string is useful only for users who + understand English. The property name does not help with other + languages.
+ +The property ID string is about the same but looks more + technically and is nothing a user should bother with. You could the ID + string and map it to an appropriate display string in a particular + language. Of course you could do that with the property ID as well and + with less overhead, but people (including software developers) tend to be + better in remembering symbolic constants than remembering numbers.
+ +| Property ID | +Property Name | +Property ID String | +Property Type | +
|---|---|---|---|
| 2 | +Title | +PID_TITLE | +VT_LPSTR | +
| 3 | +Subject | +PID_SUBJECT | +VT_LPSTR | +
| 4 | +Author | +PID_AUTHOR | +VT_LPSTR | +
| 5 | +Keywords | +PID_KEYWORDS | +VT_LPSTR | +
| 6 | +Comments | +PID_COMMENTS | +VT_LPSTR | +
| 7 | +Template | +PID_TEMPLATE | +VT_LPSTR | +
| 8 | +Last Saved By | +PID_LASTAUTHOR | +VT_LPSTR | +
| 9 | +Revision Number | +PID_REVNUMBER | +VT_LPSTR | +
| 10 | +Total Editing Time | +PID_EDITTIME | +VT_FILETIME | +
| 11 | +Last Printed | +PID_LASTPRINTED | +VT_FILETIME | +
| 12 | +Create Time/Date | +PID_CREATE_DTM | +VT_FILETIME | +
| 13 | +Last Saved Time/Date | +PID_LASTSAVE_DTM | +VT_FILETIME | +
| 14 | +Number of Pages | +PID_PAGECOUNT | +VT_I4 | +
| 15 | +Number of Words | +PID_WORDCOUNT | +VT_I4 | +
| 16 | +Number of Characters | +PID_CHARCOUNT | +VT_I4 | +
| 17 | +Thumbnail | +PID_THUMBNAIL | +VT_CF | +
| 18 | +Name of Creating Application | +PID_APPNAME | +VT_LPSTR | +
| 19 | +Security | +PID_SECURITY | +VT_I4 | +
The Document Summary Information stream has two sections with a section
+ format ID of 0xD5CDD5022E9C101B939708002B2CF9AE for the first
+ one. The following table defines the meaning of the property IDs in the
+ first section. See the preceding section for interpreting the table.
| Property ID | +Property name | +Property ID string | +VT type | +
|---|---|---|---|
| 0 | +Dictionary | +PID_DICTIONARY | +[Special format] | +
| 1 | +Code page | +PID_CODEPAGE | +VT_I2 | +
| 2 | +Category | +PID_CATEGORY | +VT_LPSTR | +
| 3 | +PresentationTarget | +PID_PRESFORMAT | +VT_LPSTR | +
| 4 | +Bytes | +PID_BYTECOUNT | +VT_I4 | +
| 5 | +Lines | +PID_LINECOUNT | +VT_I4 | +
| 6 | +Paragraphs | +PID_PARCOUNT | +VT_I4 | +
| 7 | +Slides | +PID_SLIDECOUNT | +VT_I4 | +
| 8 | +Notes | +PID_NOTECOUNT | +VT_I4 | +
| 9 | +HiddenSlides | +PID_HIDDENCOUNT | +VT_I4 | +
| 10 | +MMClips | +PID_MMCLIPCOUNT | +VT_I4 | +
| 11 | +ScaleCrop | +PID_SCALE | +VT_BOOL | +
| 12 | +HeadingPairs | +PID_HEADINGPAIR | +VT_VARIANT | VT_VECTOR | +
| 13 | +TitlesofParts | +PID_DOCPARTS | +VT_LPSTR | VT_VECTOR | +
| 14 | +Manager | +PID_MANAGER | +VT_LPSTR | +
| 15 | +Company | +PID_COMPANY | +VT_LPSTR | +
| 16 | +LinksUpTo Date | +PID_LINKSDIRTY | +VT_BOOL | +
A property consists of a DWord type field followed by the + property value. The property type is an integer value and tells how the + data byte following it are to be interpreted. In the Microsoft world it is + also known as the variant.
+ +The Usage column says where a variant type may occur. Not all + of them are allowed in a property set but just those marked with a [P]. + [V] - may appear in a VARIANT, [T] - may + appear in a TYPEDESC, [P] - may appear in an OLE property + set, [S] - may appear in a Safe Array.
+ +| Variant ID | +Variant Type | +Usage | +Description | +
|---|---|---|---|
| 0 | +VT_EMPTY | +[V] [P] | +nothing | +
| 1 | +VT_NULL | +[V] [P] | +SQL style Null | +
| 2 | +VT_I2 | +[V] [T] [P] [S] | +2 byte signed int | +
| 3 | +VT_I4 | +[V] [T] [P] [S] | +4 byte signed int | +
| 4 | +VT_R4 | +[V] [T] [P] [S] | +4 byte real | +
| 5 | +VT_R8 | +[V] [T] [P] [S] | +8 byte real | +
| 6 | +VT_CY | +[V] [T] [P] [S] | +currency | +
| 7 | +VT_DATE | +[V] [T] [P] [S] | +date | +
| 8 | +VT_BSTR | +[V] [T] [P] [S] | +OLE Automation string | +
| 9 | +VT_DISPATCH | +[V] [T] [P] [S] | +IDispatch * | +
| 10 | +VT_ERROR | +[V] [T] [S] | +SCODE | +
| 11 | +VT_BOOL | +[V] [T] [P] [S] | +True=-1, False=0 | +
| 12 | +VT_VARIANT | +[V] [T] [P] [S] | +VARIANT * | +
| 13 | +VT_UNKNOWN | +[V] [T] [S] | +IUnknown * | +
| 14 | +VT_DECIMAL | +[V] [T] [S] | +16 byte fixed point | +
| 16 | +VT_I1 | +[T] | +signed char | +
| 17 | +VT_UI1 | +[V] [T] [P] [S] | +unsigned char | +
| 18 | +VT_UI2 | +[T] [P] | +unsigned short | +
| 19 | +VT_UI4 | +[T] [P] | +unsigned short | +
| 20 | +VT_I8 | +[T] [P] | +signed 64-bit int | +
| 21 | +VT_UI8 | +[T] [P] | +unsigned 64-bit int | +
| 22 | +VT_INT | +[T] | +signed machine int | +
| 23 | +VT_UINT | +[T] | +unsigned machine int | +
| 24 | +VT_VOID | +[T] | +C style void | +
| 25 | +VT_HRESULT | +[T] | +Standard return type | +
| 26 | +VT_PTR | +[T] | +pointer type | +
| 27 | +VT_SAFEARRAY | +[T] | +(use VT_ARRAY in VARIANT) | +
| 28 | +VT_CARRAY | +[T] | +C style array | +
| 29 | +VT_USERDEFINED | +[T] | +user defined type | +
| 30 | +VT_LPSTR | +[T] [P] | +null terminated string | +
| 31 | +VT_LPWSTR | +[T] [P] | +wide null terminated string | +
| 64 | +VT_FILETIME | +[P] | +FILETIME | +
| 65 | +VT_BLOB | +[P] | +Length prefixed bytes | +
| 66 | +VT_STREAM | +[P] | +Name of the stream follows | +
| 67 | +VT_STORAGE | +[P] | +Name of the storage follows | +
| 68 | +VT_STREAMED_OBJECT | +[P] | +Stream contains an object | +
| 69 | +VT_STORED_OBJECT | +[P] | +Storage contains an object | +
| 70 | +VT_BLOB_OBJECT | +[P] | +Blob contains an object | +
| 71 | +VT_CF | +[P] | +Clipboard format | +
| 72 | +VT_CLSID | +[P] | +A Class ID | +
| 0x1000 | +VT_VECTOR | +[P] | +simple counted array | +
| 0x2000 | +VT_ARRAY | +[V] | +SAFEARRAY* | +
| 0x4000 | +VT_BYREF | +[V] | +void* for local use | +
| 0x8000 | +VT_RESERVED | +||
| 0xFFFF | +VT_ILLEGAL | +||
| 0xFFF | +VT_ILLEGALMASKED | +||
| 0xFFF | +VT_TYPEMASK | +
What a dictionary is good for is explained in the HPSF HOW-TO. This chapter explains how it is + organized internally.
+ +The dictionary has a simple header consisting of a single UInt value. It + tells how many entries the dictionary comprises:
+ +| Name | +Data type | +Description | +
|---|---|---|
| nrEntries | +UInt | +Number of dictionary entries | +
The dictionary entries follow the header. Each one looks like this:
+ +| Name | +Data type | +Description | +
|---|---|---|
| key | +UInt | +The unique number of this property, i.e. the PID | +
| length | +UInt | +The length of the property name associated with the key | +
| value | +String | +The property's name, terminated with a 0x00 character | +
The entries are not aligned, i.e. each one follows its predecessor + without any gap or fill characters.
+In order to assemble the HPSF description I used information publically + available on the Internet only. The references given below have been very + helpful. If you have any amendments or corrections, please let us know! + Thank you!
+ +VT_ types is in
+ Variant
+ Type Definitions.FILETIME? The answer can be found
+ under , https://www.vbapi.com/ref/f/filetime.html or
+ https://www.cs.rpi.edu/courses/fall01/os/FILETIME.html.
+ In short: The FILETIME structure holds a date and time associated
+ with a file. The structure identifies a 64-bit integer specifying the
+ number of 100-nanosecond intervals which have passed since January 1,
+ 1601. This 64-bit value is split into the two dwords stored in the
+ structure.Thumbnail information is stored as a VT_CF, or Thumbnail Variant. The + Thumbnail Variant is used to store various types of information in a + clipboard. The VT_CF can store information in formats for the Macintosh or + Windows clipboard.
+ +There are many types of data that can be copied to the clipboard, but the + only types of information needed for thumbnail manipulation are the image + formats.
+ +The VT_CF structure looks like this:
| Element: | +Clipboard Size | +Clipboard Format Tag | +Clipboard Data | +
|---|---|---|---|
| Size: | +32 bit unsigned integer (DWord) | +32 bit signed integer (DWord) | +variable length (byte array) | +
The Clipboard Size refers to the size (in bytes) of Clipboard Data + (variable size) plus the Clipboard Format (four bytes).
+ +Clipboard Format Tag has four possible values:
+ +| Value | +Identifier | +Description | +
|---|---|---|
-1L |
+ CFTAG_WINDOWS |
+ a built-in Windows© clipboard format value | +
-2L |
+ CFTAG_MACINTOSH |
+ a Macintosh clipboard format value | +
-3L |
+ CFTAG_FMTID |
+ a format identifier (FMTID) This is rarely used. | +
0L |
+ CFTAG_NODATA |
+ No data This is rarely used. | +
Windows clipboard data has four image formats for thumbnails:
+ +| Value | +Identifier | +Description | +
|---|---|---|
| 3 | +CF_METAFILEPICT |
+ Windows metafile format - recommended | +
| 8 | +CF_DIB |
+ Device Independent Bitmap | +
| 14 | +CF_ENHMETAFILE |
+ Enhanced Windows metafile format | +
| 2 | +CF_BITMAP |
+ Bitmap - Obsolete - Use CF_DIB instead |
+
The most common format for thumbnails on the Windows platform is the + Windows metafile format. The Clipboard places and extra header in front of + a the standard Windows Metafile Format data.
+ +The Clipboard Data byte array looks like this when an image is stored in + Windows' Clipboard WMF format.
+ +| Identifier | +CF_METAFILEPICT | +mm | +width | +height | +handle | +WMF data | +
|---|---|---|---|---|---|---|
| Size | +32 bit unsigned int | +16 bit unsigned(?) int | +16 bit unsigned(?) int | +16 bit unsigned(?) int | +16 bit unsigned(?) int | +byte array - variable length | +
| Description | +Clipboard WMF | +Mapping Mode | +Image Width | +Image Height | +handle to the WMF data array in memory, or 0 | +standard WMF byte stream | +
FIXME: Describe the Device Independent Bitmap + format!
+FIXME: Describe the Macintosh clipboard formats!
+The following functionalities should be added to HPFS:
+ +org.apache.poi.hpsf.wellknown to ease
+ localizations. This would be useful for mapping standard property IDs to
+ localized strings. Example: The property ID 4 could be mapped to "Author"
+ in English or "Verfasser" in German.
+ java.awt.Image example code in the Thumbnail HOW-TO.
+ HSMF is the POI Project's pure Java implementation of the Outlook MSG format.
+At this time, it provides low-level read access to all of the file, along + with a user-facing way to get at the common textual content of MSG files. + to all
+There is an example MSG textual renderer, which shows how to access the + common parts such as sender, subject, message body and examples. This is + in the + HSMF examples area + of SVN. You may also wish to look at the unit tests for more use guides.
+ +The Apache POI project is the master project for developing pure + Java ports of file formats based on Microsoft's OLE 2 Compound + Document Format. OLE 2 Compound Document Format is used by + Microsoft Office Documents, as well as by programs using MFC + property sets to serialize their document objects. +
+Apache POI is also the master project for developing pure + Java ports of file formats based on Office Open XML (ooxml). + OOXML is part of an ECMA / ISO standardisation effort. This + documentation is quite large, but you can normally find the bit you + need without too much effort! + ECMA-376 standard is here, + and is also under the + Microsoft OSP. +
+ + ++ POIFS is the oldest and most stable part of POI. It is our port of the OLE 2 Compound Document Format to + pure Java. It supports both read and write functionality. All of our components for the binary (non-XML) + Microsoft Office formats ultimately rely on it by + definition. Please see the POIFS project page for more information. +
++ HSSF is our port of the Microsoft Excel 97 (-2003) file format (BIFF8) to pure + Java. XSSF is our port of the Microsoft Excel XML (2007+) file format (OOXML) to + pure Java. SS is a package that provides common support for both formats with a common API. + They both support read and write capability. Please see + the HSSF+XSSF project page for more + information. +
++ HWPF is our port of the Microsoft Word 97 (-2003) file format to pure + Java. It supports read, and limited write capabilities. It also provides + simple text extraction support for the older Word 6 and Word 95 formats. + Please see the HWPF project page for more + information. This component remains in early stages of + development. It can already read and write simple files. +
++ We are also working on the XWPF for the WordprocessingML (2007+) format from the + OOXML specification. This provides read and write support for simpler + files, along with text extraction capabilities. +
++ HSLF is our port of the Microsoft PowerPoint 97(-2003) file format to pure + Java. It supports read and write capabilities. Please see the HSLF project page for more + information. +
++ We are also working on the XSLF for the PresentationML (2007+) format from the + OOXML specification. +
++ HPSF is our port of the OLE 2 property set format to pure + Java. Property sets are mostly use to store a document's properties + (title, author, date of last modification etc.), but they can be used + for application-specific purposes as well. +
++ HPSF supports both reading and writing of properties. +
++ Please see the HPSF project + page for more information. +
++ HDGF is our port of the Microsoft Visio 97(-2003) file format to pure + Java. It currently only supports reading at a very low level, and + simple text extraction. Please see the HDGF / Diagram project page for more + information. +
++ XDGF is our port of the Microsoft Visio XML (.vsdx) file format to pure + Java. It has slightly more support than HDGF. Please see the XDGF / Diagram project page for more + information. +
++ HPBF is our port of the Microsoft Publisher 98(-2007) file format to pure + Java. It currently only supports reading at a low level for around + half of the file parts, and simple text extraction. Please see the HPBF project page for more + information. +
++ HMEF is our port of the Microsoft TNEF (Transport Neutral Encoding + Format) file format to pure Java. TNEF is sometimes used by Outlook + for encoding the message, and will typically come through as + winmail.dat. HMEF currently only supports reading at a low level, but + we hope to add text and attachment extraction. Please see the HMEF project page for more + information. +
++ HSMF is our port of the Microsoft Outlook message file format to pure + Java. It currently only some of the textual content of MSG files, and + some attachments. Further support and documentation is coming in slowly. + For now, users are advised to consult the unit tests for example use. + Please see the HSMF project page for more + information. +
++ Microsoft has recently added the Outlook file format to its OSP. More information + is now available making implementing this API an easier task. +
++ The Apache POI distribution consists of support for many document file formats. This support is provided + in several Jar files. Not all of the Jars are needed for every format. The following tables + show the relationships between POI components, Maven repository tags, and the project's Jar files. +
+| Component | +Application type | +Maven artifactId | +Notes | +
|---|---|---|---|
| POIFS | +OLE2 Filesystem | +poi | +Required to work with OLE2 / POIFS based files | +
| HPSF | +OLE2 Property Sets | +poi | ++ |
| HSSF | +Excel XLS | +poi | +For HSSF only, if common SS is needed see below | +
| HSLF | +PowerPoint PPT | +poi-scratchpad | ++ |
| HWPF | +Word DOC | +poi-scratchpad | ++ |
| HDGF | +Visio VSD | +poi-scratchpad | ++ |
| HPBF | +Publisher PUB | +poi-scratchpad | ++ |
| HSMF | +Outlook MSG | +poi-scratchpad | ++ |
| DDF | +Escher common drawings | +poi | ++ |
| HWMF | +WMF drawings | +poi-scratchpad | ++ |
| OpenXML4J | +OOXML | +poi-ooxml plus either poi-ooxml-lite or + poi-ooxml-full |
+ See notes below for differences between these options | +
| XSSF | +Excel XLSX | +poi-ooxml | ++ |
| XSLF | +PowerPoint PPTX | +poi-ooxml | ++ |
| XWPF | +Word DOCX | +poi-ooxml | ++ |
| XDGF | +Visio VSDX | +poi-ooxml | ++ |
| Common SL | +PowerPoint PPT and PPTX | +poi-scratchpad and poi-ooxml | +SL code is in the core POI jar, but implementations are in poi-scratchpad + and poi-ooxml. | +
| Common SS | +Excel XLS and XLSX | +poi-ooxml | +WorkbookFactory and friends all require poi-ooxml, not just core poi | +
+ This table maps artifacts into the jar file name. "version-yyyymmdd" is + the POI version stamp. You can see what the latest stamp is on the + downloads page. +
+| Maven artifactId | +Prerequisites | +JAR | +
|---|---|---|
| poi | +log4j 2.x, + commons-codec, + commons-collections, + commons-math3 + commons-io + | +poi-version-yyyymmdd.jar | +
| poi-scratchpad | +poi | +poi-scratchpad-version-yyyymmdd.jar | +
| poi-ooxml | +poi,
+ poi-ooxml-lite,
+ commons-compress,
+ SparseBitSet + For SVG support: + batik-all, + xml-apis-ext, + xmlgraphics-commons + For PDF support: + pdfbox, + fontbox, + rototor graphics2d + |
+ poi-ooxml-version-yyyymmdd.jar | +
| poi-ooxml-lite | +xmlbeans | +poi-ooxml-lite-version-yyyymmdd.jar | +
| poi-examples | +poi, + poi-scratchpad, + poi-ooxml + | +poi-examples-version-yyyymmdd.jar | +
| poi-ooxml-full (known as ooxml-schemas) | +xmlbeans + For signing: + bcpkix-jdk18on, + bcprov-jdk18on, + xmlsec, + slf4j-api + |
+ poi-ooxml-full-version-yyyymmdd.jar | +
+
+ poi-ooxml requires poi-ooxml-lite. This is a substantially smaller + version of the poi-ooxml-full jar (ooxml-schemas-1.4.jar for POI 4.0.0, + ooxml-schemas-1.3.jar for POI 3.14 or to POI 3.17, + ooxml-schemas-1.1.jar for POI 3.7 up to POI 3.13, ooxml-schemas-1.0.jar + for POI 3.5 and 3.6). + The larger poi-ooxml-full (formerly, ooxml-schemas) jar is normally + only required for features that are not fully implemented in poi-ooxml. + There used to also be an ooxml-security jar, which contained + all of the classes relating to encryption and signing. POI 5 no longer needs this jar. + The equivalent classes are now in poi-ooxml-full and poi-ooxml-lite. + This JAR was ooxml-security-1.1.jar for POI 3.14 and POI 4. ooxml-security-1.0.jar + was used prior to that. +
++ The OOXML jars require a stax implementation, but now that Apache + POI requires Java 8, that dependency is provided by the JRE and no additional + stax jars are required. The OOXML jars used to require DOM4J, but + the code has now been changed to use JAXP and no additional dom4j + jars are required. By the way, look at this FAQ + if you have problems when using a non-Oracle JDK. +
++ The ooxml schemas jars are compiled with Apache XMLBeans. + It is recommended that you use the XMLBeans version that was used to build the POI OOXML schemas. + It may be possible to use newer XMLBeans jars but there are no guarantees, especially if the XMLBeans version + numbers differ a lot. +
++ Small sample programs using the POI API are available in the + src/examples + (viewvc) directory of the source distribution. +
++ All of the examples are included in POI distributions as a poi-examples artifact. +
++ POI can be run on most languages that run on the JVM. For code examples, + see Running POI on other JVM languages +
++ Besides the "official" components outlined above there is some further + software distributed with POI. This is called "contributed" software. It + is not explicitly recommended or even maintained by the POI team, but + it might still be useful to you. +
++ See POI Ruby Bindings and other code in the + poi-contrib module +
++ Logging in POI is used primarily as a debugging mechanism, not a normal runtime + logging system. Logging at levels noisier than WARN is ONLY for autopsy type debugging, and should + NEVER be enabled on a production system. +
++ Since version 5.1.0 Apache POI uses Apache Log4j v2 directly. +
++ Apache POI only depends on log4j-api and allows choosing which logging framework to use. log4j-core is + just one of many options. + If you want to continue to use another SLF4J compatible logging framework, you can deploy the + log4j-to-slf4j jar to + facilitate this. +
+
+ POI tries to name loggers after the canonical name of the containing class. For example,
+ org.apache.poi.poifs.filesystem.POIFSFileSystem. Use your logging framework's typical
+ mechanisms for activating and deactivating logging for specific loggers.
+
+ All loggers are named com.apache.poi.*, so rules applied to com.apache.poi
+ will affect all POI loggers.
+
+ Capturing POI logs using Log4j 2 Core is as simple as including the
+ log4j-core JAR in
+ your project. POI also has dependencies on libraries that make use of the SLF4J and Apache Commons
+ Logging APIs. Gather logs from these dependencies by adding the
+ Commons Logging Bridge and the
+ the SLF4J Binding to your
+ project.
+
+ The simplest configuration is to capture all POI logs at the same level as your application. You might
+ want to collect all messages INFO and higher, and are OK with capturing POI messages as well.
+
+ A more recommended configuration is to capture only messages from loggers you opt in to. For example,
+ you might want to capture all messages from com.example.myapplication at INFO
+ but only POI messages at WARN or more severe.
+
Another strategy you may decide to use is to capture all messages except those coming from POI.
++ If your main aim is just to get rid of the scary logging log message from Log4J that says + 'ERROR StatusLogger Log4j2 could not find a logging implementation.', then one option is to + enable the SimpleLogger using a system property. +
++ -Dlog4j2.loggerContextFactory=org.apache.logging.log4j.simple.SimpleLoggerContextFactory +
++ If you want to continue to use another SLF4J compatible logging framework, you can deploy the + log4j-to-slf4j jar + and the intended slf4j-bridges to facilitate this. +
++ See https://www.slf4j.org/ for more details about using SLF4J. +
++ Capturing POI logs using Logback requires adding the + Log4j to SLF4J Adapter to + your project, along with the standard Logback dependencies. POI also has dependencies on libraries that + make use of the SLF4J and Apache Commons Logging APIs. Gather logs from these dependencies by adding the + Commons Logging Bridge to your project. +
+ +
+ The simplest configuration is to capture all POI logs at the same level as your application. You might
+ want to collect all messages INFO and higher, and are OK with capturing POI messages as well.
+
+ A more recommended configuration is to capture only messages from loggers you opt in to. For example,
+ you might want to capture all messages from com.example.myapplication at INFO
+ but only POI messages at WARN or more severe.
+
Another strategy you may decide to use is to capture all messages except those coming from POI.
++ POI 5.0.0 switched to using SLF4J for logging. If you want + to enable logging, please read up on the various SLF4J compatible logging frameworks. + Apache Log4j v2 is a good choice. + Logback is also widely used. +
++ Prior to POI 5.0.0, POI used a custom logging framework which allows to configure where logs are sent to. +
++ Logging in POI 3 and 4 is used only as a debugging mechanism, not as a normal runtime + logging system. Logging at level debug/info is ONLY for debugging, and should + NEVER be enabled on a production system. +
++ The framework is extensible so that you can send log messages to any logging framework + that your application uses. +
++ A number of default logging implementations are supported by POI out-of-the-box and can be selected via a + system property. +
++ By default, logging is disabled in POI 3 and 4. Sometimes, it might be useful + to enable logging to see some debug messages printed out which can + help in analyzing problems. +
++ You can select the logging framework by setting the system property org.apache.poi.util.POILogger during application startup or by calling System.setProperty(): +
++ Note: You need to call setProperty() before any POI functionality is invoked as the logger is only initialized during startup. +
++ The following logger implementations are provided by POI 3 and 4: +
+| Class | +Type | +
|---|---|
| org.apache.poi.util.SystemOutLogger | +Sends log output to the system console | +
| org.apache.poi.util.NullLogger | +Default logger, does not log anything | +
| org.apache.poi.util.CommonsLogger | +Allows to use Apache Commons Logging for logging. This can use JDK1.4 logging, + log4j, logkit, etc. The log4j dependency was removed in POI 5.0.0, so you will need to include this dependency yourself if you need it. | +
| org.apache.poi.util.DummyPOILogger | +Simple logger which will keep all log-lines in memory for later analysis (this class is not in the jar, just in the test source). + Used primarily for testing. Note: this may cause a memory leak if used in production application! | +
+ You can send logs to other logging frameworks by implementing the interface org.apache.poi.util.POILogger. +
+
+ Every class uses a POILogger to log, and gets it using a static method
+ of the POILogFactory .
+
+ Each class in POI can log using a POILogger, which is an abstract class.
+ We decided to make our own logging facade because:
OpenXML4J is the POI Project's pure Java implementation of the Open Packaging Conventions (OPC) defined in + ECMA-376.
+Every OpenXML file comprises a collection of byte streams called parts, combined into a container called a package. + POI OpenXML4J provides a physical implementation of the OPC that uses the Zip file format.
+OpenXML4J was originally developed by + openxml4j.org, + and was contributed to Apache POI in 2008. The original code is available at + https://sourceforge.net/projects/openxml4j/. + Thanks to the support and guidance of Julien Chable
++ Apache POI can be used with any + JVM language + that can import Java jar files such as Jython, Groovy, Scala, Kotlin, and JRuby. +
+ +If you use POI in a different language (Kotlin, JRuby, ...) and would like to share a Hello POI! example, + please share it.
+Please let us know if you use POI in an environment not listed here
+There are several websites that have examples of using Apache POI in Jython projects: + python.org, + jython.org, and many others. +
+The POI library can now be compiled as a Ruby extension, allowing the API to be called from + Ruby language programs. Ruby users can therefore read and write OLE2 documents, such as Excel files + with ease +
+The bindings are generated by compiling POI with gcj, + and generating the Ruby wrapper using SWIG. The aim is the keep + the POI api as-is. However, where java standard library objects are used, an effort is made to transform them smoothly + into Ruby objects. Therefore, where the POI API takes an OutputStream, you can pass an IO object. Where the POI works + java.util.Date or java.util.Calendar object, you can work with a Ruby Time object.
+The bindings have been developed with GCC 3.4.3 and Ruby 1.8.2. You are unlikely to get correct results with + versions of GCC prior to 3.4 or versions of Ruby prior to 1.8. To compile the Ruby extension, you must have + GCC (compiled with java language support), Ruby development headers, and SWIG. To run, you will need Ruby (obviously!) and + libgcj , presumably from the same version of GCC with which you compiled. +
++ The POI-Ruby module sits under the POI Subversion + (viewvc). Running make + inside that directory will create a loadable ruby extension poi4r.so in the release subdirectory. Tests + are in the tests/ subdirectory, and should be run from the poi-ruby directory. Please read the tests to figure out the usage. +
+Note that the makefile, though designed to work across Linux/OS X/Cygwin, has been tested only on linux. + There are likely to be issues on other platform; fixes gratefully accepted!
+A version of poi4r.so is available here (broken link). Its been compiled on a linux box + with GCC 3.4.3 and Ruby 1.8.2. It dynamically links to libgcj. No guarantees about working on any other box.
+The following ruby code shows some of the things you can do with POI in Ruby
+The tc_base_tests.rb file in the tests sub directory of the source distribution + contains examples of simple uses of the API. The quick guide is the best + place to learn HSSF API use. (Note however that none of the Drawing features are implemented in the Ruby binding.) + See also the POI API documentation for more details. +
+This document describes the design of the POIFS system. It is organized as follows:
+This document is written as part of an iterative process. As that process is not yet complete, neither is + this document. +
+The design of POIFS is not dependent on the code written for the proof-of-concept prototype POIFS + package. +
+As usual, the primary considerations in the design of the POIFS assumption involve the classic space-time + tradeoff. In this case, the main consideration has to involve minimizing the memory footprint of POIFS. + POIFS may be called upon to create relatively large documents, and in web application server, it may be + called upon to create several documents simultaneously, and it will likely co-exist with other + Serializer systems, competing with those other systems for space on the server. +
+We've addressed the risk of being too slow through a proof-of-concept prototype. This prototype for POIFS + involved reading an existing file, decomposing it into its constituent documents, composing a new POIFS + from the constituent documents, and writing the POIFS file back to disk and verifying that the output + file, while not necessarily a byte-for-byte image of the input file, could be read by the application + that generated the input file. This prototype proved to be quite fast, reading, decomposing, and + re-generating a large (300K) file in 2 to 2.5 seconds. +
+While the POIFS format allows great flexibility in laying out the documents and the other internal data + structures, the layout of the filesystem will be kept as simple as possible. +
+The design of the POIFS is broken down into two parts: discussion of the classes and + interfaces, and discussion of how these classes and interfaces will be used to + convert an appropriate Java InputStream (such as an XML stream) to a POIFS output stream containing an + HSSF document. +
++ Classes and Interfaces +
+The classes and interfaces used in the POIFS are broken down as follows:
+| Package | +Contents | +
|---|---|
| + net.sourceforge.poi.poifs.storage + | +Block classes and interfaces | +
| + net.sourceforge.poi.poifs.property + | +Property classes and interfaces | +
| + net.sourceforge.poi.poifs.filesystem + | +Filesystem classes and interfaces | +
| + net.sourceforge.poi.util + | +Utility classes and interfaces | +
The block classes and interfaces are shownin the following class diagram.
+
+
+
| Class/Interface | +Description | +
|---|---|
| BATBlock | +The BATBlock class represents a single big block containing 128
+ BAT entries. Its _fields array is used to
+ read and write the BAT entries into the _data array.
+ Its createBATBlocks method is used to create an array of BATBlock
+ instances from an array of int BAT entries.
+ + Its calculateStorageRequirements method calculates the number of BAT blocks
+ necessary to hold the specified number of BAT entries.
+ |
+
| BigBlock | +The BigBlock class is an abstract class representing the common big block
+ of 512 bytes. It implements BlockWritable, trivially delegating
+ the writeBlocks method of BlockWritable to its own abstract writeData
+ method.
+ |
+
| BlockWritable | +The BlockWritable interface defines a single method,
+ writeBlocks, that is used to write an implementation's block data to an
+ OutputStream.
+ |
+
| DocumentBlock | +The DocumentBlock class is used by a
+ Document
+ to holds its raw data. It also retains the number of bytes read, as this is used by the
+ Document class to determine the total size of the data, and is also used internally to
+ determine whether the block was filled by the
+ InputStream
+ or not.
+ + The DocumentBlock constructor is passed an InputStream from which
+ to fill its _data array.
+ + The size method returns the number of bytes read (_bytes_read)
+ when the instance was constructed.
+ + The partiallyRead method returns true if the _data array was not
+ completely filled, which may be interpreted by the Document as having reached the end of
+ file point.Typical use of the DocumentBlock class is like this: + + |
+
| HeaderBlock | +The HeaderBlock class is used to contain the data found in a POIFS header.
+ + Its IntegerField members are used to read and write the + appropriate entries into the + _data
+ array.Its + setBATBlocks
+ ,
+ setPropertyStart
+ , and
+ setXBATStart
+ methods are used to set the appropriate fields in the
+ _data
+ array.The + calculateXBATStorageRequirements
+ method is used to determine how many XBAT blocks are necessary to accommodate the specified
+ number of BAT blocks.
+ |
+
| PropertyBlock | +The PropertyBlock class is used to contain
+ Property
+ instances for the
+ PropertyTable
+ class. It contains an array, _properties of 4 Property instances, which
+ together comprise the 512 bytes of a BigBlock.
+ + The createPropertyBlockArray method is used to convert a
+ List
+ of Property instances into an array of PropertyBlock instances. The number of Property
+ instances is rounded up to a multiple of 4 by creating empty anonymous inner class
+ extensions of Property.
+ |
+
The property classes and interfaces are shown in the following class diagram. +
+
+
+
| Class/Interface | +Description | +
|---|---|
| Directory | +The Directory interface is implemented by the
+ RootProperty
+ class. It is not strictly necessary for the initial POIFS implementation, but when the POIFS
+ supports directory elements, this interface
+ will be more widely implemented, and so is included in the design at this point to ease the
+ eventual support of directory elements. Its methods are a getter/setter pair, + getChildren
+ , returning an Iterator of
+ Property
+ instances; and
+ addChild
+ , which will allow the caller to add another Property instance to the Directory's children.
+ |
+
| DocumentProperty | +The DocumentProperty class is a trivial extension of
+ Property
+ and is used by Document to keep track of its associated entry in
+ the
+ PropertyTable. Its constructor takes a name and the + document size, on the assumption that the Document will not create a DocumentProperty until + after it has created the storage for the document data and therefore knows how much data + there is. + |
+
| File | +The File interface specifies the behavior of reading and writing the next + and previous child fields of a Property. + | +
| Property | +The Property class is an abstract class that defines the basic data
+ structure of an element of the
+ Property Table. Its ByteField, + ShortField, and + IntegerField + members are used to read and write data into the appropriate locations in the + _raw_data
+ array.The + _index
+ member is used to hold a Propery instance's index in the List of Property
+ instances maintained by PropertyTable, which is used to
+ populate the child property of parent
+ Directory
+ properties and the next property and previous property of sibling
+ File
+ properties.The + _name
+ ,
+ _next_file
+ , and
+ _previous_file
+ members are used to help fill the appropriate fields of the _raw_data array.Setters are + provided for some of the fields (name, property type, node color, child property, size, + index, start block), as well as a few getters (index, child property). The + preWrite
+ method is abstract and is used by the owning PropertyTable to iterate through its Property
+ instances and prepare each for writing.The + shouldUseSmallBlocks
+ method returns true if the Property's size is sufficiently small - how small is none of the
+ caller's business.
+ |
+
| PropertyBlock | +See the description in PropertyBlock. + | +
| PropertyTable | +The PropertyTable class holds all of the
+ DocumentProperty
+ instances and the
+ RootProperty
+ instance for a
+ Filesystem
+ instance. It maintains a + List
+ of its
+ Property
+ instances (
+ _properties
+ ), and when prepared to write its data by a call to
+ preWrite
+ , it gets and holds an array of
+ PropertyBlock
+ instances (
+ _blocks) .It also maintains its start block in its + _start_block
+ member.It has a method, + getRoot
+ , to get the RootProperty, returning it as an implementation of
+ Directory, and a method to add a Property,
+ addProperty
+ , and a method to get its start block,
+ getStartBlock
+ .
+ |
+
| RootProperty | +The RootProperty class acts as the Directory for
+ all of the
+ DocumentProperty
+ instance. As such, it is more of a pure directory
+ entry
+ than a proper root entry
+ in the Property Table, but the initial
+ POIFS implementation does not warrant the additional complexity of a full-blown root entry,
+ and so it is not modeled in this design. It maintains a + List
+ of its children,
+ _children
+ , in order to perform its directory-oriented duties.
+ |
+
The property classes and interfaces are shown in the following class diagram. +
+
+
+
| Class/Interface | +Description | +
|---|---|
| Filesystem | +The Filesystem class is the top-level class that manages the creation of a
+ POIFS document. It maintains a + PropertyTable + instance in its + _property_table
+ member, a
+ HeaderBlock
+ instance in its
+ _header_block
+ member, and a List of its
+ Document
+ instances in its
+ _documents
+ member.It provides methods for a client to create a document ( + createDocument
+ ), and a method to write the Filesystem to an
+ OutputStream
+ (
+ writeFilesystem
+ ).
+ |
+
| BATBlock | +See the description in + BATBlock + | +
| BATManaged | +The BATManaged interface defines common behavior for objects whose location
+ in the written file is managed by the Block Allocation
+ Table. It defines methods to get a count of the implementation's + BigBlock + instances ( + countBlocks
+ ), and to set an implementation's start block (
+ setStartBlock
+ ).
+ |
+
| BlockAllocationTable | +The BlockAllocationTable is an implementation of the
+ POIFS Block Allocation Table. It is only created when the
+ Filesystem
+ is about to be written to an
+ OutputStream.It contains an IntList of block + numbers for all of the + BATManaged + implementations owned by the Filesystem, + _entries
+ , which is filled by calls to
+ allocateSpace
+ .It fills its array, + _blocks
+ , of
+ BATBlock
+ instances when its
+ createBATBlocks
+ method is called. This method has to take into account its own storage requirements, as well
+ as those of the XBAT blocks, and so calls
+ BATBlock.calculateStorageRequirements
+ and
+ HeaderBlock.calculateXBATStorageRequirements
+ repeatedly until the counts returned by those methods stabilize.The + countBlocks
+ method returns the number of BATBlock instances created by the preceding call to
+ createBlocks.
+ |
+
| BlockWritable | +See the description in + BlockWritable + | +
| Document | +The Document class is used to contain a document, such as an HSSF workbook.
+ It has its own + DocumentProperty + ( + _property
+ ) and stores its data in a collection of
+ DocumentBlock
+ instances (
+ _blocks
+ ).It has a method, + getDocumentProperty
+ , to get its DocumentProperty.
+ |
+
| DocumentBlock | +See the description in + DocumentBlock + | +
| DocumentProperty | +See the description in + DocumentProperty + | +
| HeaderBlock | +See the description in + HeaderBlock + | +
| PropertyTable | +See the description in + PropertyTable + | +
The utility classes and interfaces are shown in the following class diagram. +
+
+
+
| Class/Interface | +Description | +
|---|---|
| BitField | +The BitField class is used primarily by HSSF code to manage bit-mapped + fields of HSSF records. It is not likely to be used in the POIFS code itself and is only + included here for the sake of complete documentation of the POI utility classes. + | +
| ByteField | +The ByteField class is an implementation of
+ FixedField
+ for the purpose of managing reading and writing to a byte-wide field in an array of
+ bytes.
+ |
+
| FixedField | +The FixedField interface defines a set of methods for reading a field from
+ an array of
+ bytes
+ or from an
+ InputStream, and for writing a field to an array of
+ bytes. Implementations typically require an offset in their constructors that,
+ for the purposes of reading and writing to an array of
+ bytes, makes sure that the correct bytes in the array are read or
+ written.
+ |
+
| HexDump | +The HexDump class is a debugging class that can be used to dump an array of
+ bytes
+ to an OutputStream. The static method
+ dump
+ takes an array of bytes, a long offset that is used to label the
+ output, an open
+ OutputStream, and an
+ int
+ index that specifies the starting index within the array of
+ bytes.The data is displayed 16 bytes per line, with each byte displayed in + hexadecimal format and again in printable form, if possible (a byte is considered printable + if its value is in the range of 32 ... 126). Here is an example of a small array of + bytes
+ with an offset of 0x110:
+ + |
+
| IntegerField | +The IntegerField class is an implementation of
+ FixedField
+ for the purpose of managing reading and writing to an integer-wide field in an array
+ of bytes.
+ |
+
| IntList | +The IntList class is a work-around for functionality missing in Java (see
+
+ https://developer.java.sun.com/developer/bugParade/bugs/4487555.html
+
+ for details); it is a simple growable array of ints that gets around the
+ requirement of wrapping and unwrapping ints in
+ Integer
+ instances in order to use the
+ java.util.List
+ interface.
+ + IntList + mimics the functionality of the + java.util.List
+ interface as much as possible.
+ |
+
| LittleEndian | +The LittleEndian class provides a set of static methods for reading and
+ writing
+ shorts,
+ ints, longs, and doubles in and out of
+ byte
+ arrays, and out of
+ InputStreams, preserving the Intel byte ordering and encoding of these values.
+ |
+
| LittleEndianConsts | +The
+ LittleEndianConsts
+ interface defines the width of a
+ short, int,
+ long, and
+ double
+ as stored by Intel processors.
+ |
+
| LongField | +The LongField class is an implementation of
+ FixedField
+ for the purpose of managing reading and writing to a long-wide field in an array of
+ bytes.
+ |
+
| ShortField | +The ShortField class is an implementation of
+ FixedField
+ for the purpose of managing reading and writing to a short-wide field in an array of
+ bytes.
+ |
+
| ShortList | +The ShortList class is a work-around for functionality missing in Java (see
+
+ https://developer.java.sun.com/developer/bugParade/bugs/4487555.html
+
+ for details); it is a simple growable array of shorts that gets around the
+ requirement of wrapping and unwrapping shorts in
+ Short
+ instances in order to use the
+ java.util.List
+ interface.
+ + ShortList + mimics the functionality of the + java.util.List
+ interface as much as possible.
+ |
+
| StringUtil | +The StringUtil class manages the processing of Unicode strings. + | +
This section describes the scenarios of how the POIFS classes and interfaces will be used to convert an + appropriate XML stream to a POIFS output stream containing an HSSF document. +
+It is broken down as suggested by the following scenario diagram: +
+
+
+
| Step | +Description | +
|---|---|
| 1 | ++ The Filesystem is created by the client application. + + | +
| 2 | +The client application tells the Filesystem to create a document,
+ providing an
+ InputStream
+ and the name of the document. This may be repeated several times.
+ |
+
| 3 | +
+ The client application asks the Filesystem to write its data to
+ an OutputStream.
+
+ |
+
Initialization of the POIFS system is shown in the following scenario diagram: +
+
+
+
| Step | +Description | +
|---|---|
| 1 | +The + Filesystem + object, which is created for each request to convert an appropriate XML stream to a POIFS + output stream containing an HSSF document, creates its + PropertyTable. + | +
| 2 | +The
+ PropertyTable
+ creates its
+ RootProperty
+ instance, making the RootProperty the first
+ Property
+ in its List of Property instances.
+ |
+
| 3 | +The + Filesystem + creates its + HeaderBlock + instance. It should be noted that the decision to create the HeaderBlock at Filesystem + initialization is arbitrary; creation of the HeaderBlock could easily and harmlessly be + postponed to the appropriate moment in + writing the filesystem. + | +
Creating and adding a document to a POIFS system is shown in the following scenario diagram: +
+
+
+
| Step | +Description | +
|---|---|
| 1 | +The
+ Filesystem
+ instance creates a new
+ Document
+ instance. It will store the newly created Document in a
+ List
+ of
+ BATManaged
+ instances.
+ |
+
| 2 | +The Document reads data from the provided
+ InputStream, storing the data in
+ DocumentBlock
+ instances. It keeps track of the byte count as it reads the data.
+ |
+
| 3 | +The Document creates a + DocumentProperty + to keep track of its property data. The byte count is stored in the newly created + DocumentProperty instance. + | +
| 4 | +The + Filesystem + requests the newly created + DocumentProperty + from the newly created + Document + instance. + | +
| 5 | +The
+ Filesystem
+ sends the newly created
+ DocumentProperty
+ to the Filesystem's
+ PropertyTable
+ so that the PropertyTable can add the DocumentProperty to its
+ List
+ of
+ Property
+ instances.
+ |
+
| 6 | +The Filesystem gets the + RootProperty + from its PropertyTable. + | +
| 7 | +The Filesystem adds the newly created + DocumentProperty + to the RootProperty. + | +
Although typical deployment of the POIFS system will only entail adding a single + Document + (the workbook) to the Filesystem, there is nothing in the design to + prevent multiple Documents from being added to the Filesystem. This flexibility can be employed to + write summary information document(s) in addition to the workbook. +
+Writing the filesystem is shown in the following scenario diagram: +
+
+
+
| Step | +Description | +|
|---|---|---|
| 1 | +The Filesystem adds the
+ PropertyTable
+ to its List of
+ BATManaged
+ instances and calls the PropertyTable's
+ preWrite
+ method. The action taken by the PropertyTable is shown in
+ the PropertyTable preWrite scenario diagram.
+ |
+ |
| 2 | +The + Filesystem + creates the BlockAllocationTable. + | +|
| 3 | +The Filesystem gets the block count from the + BATManaged + instance. + | +These three steps are repeated for each
+ BATManaged
+ instance in the Filesystem's
+ List
+ of BATManaged instances (i.e., the Documents, in order of their
+ addition to the Filesystem, followed by the PropertyTable).
+ |
+
| 4 | +The + Filesystem + sends the block count to the + BlockAllocationTable, which adds the appropriate entries to is + IntList + of entries, returning the starting block for the newly added entries. + | +|
| 5 | +The + Filesystem + gives the start block number to the + BATManaged + instance. If the BATManaged instance is a Document, it sets the + start block field in its + DocumentProperty. + | +|
| 6 | +The + Filesystem + tells the + BlockAllocationTable + to create its BatBlocks. + | +|
| 7 | +The + Filesystem + gives the BAT information to the HeaderBlock so that it can set + its BAT fields and, if necessary, create XBAT blocks. + | +|
| 8 | +If the filesystem is unusually large (over 7MB), the + HeaderBlock + will create XBAT blocks to contain the BAT data that it cannot hold directly. In this case, + the + Filesystem + tells the HeaderBlock where those additional blocks will be stored. + | +|
| 9 | +The + Filesystem + gives the + PropertyTable + start block to the HeaderBlock. + | +|
| 10 | +The
+ Filesystem
+ tells the
+ BlockWritable
+ instance to write its blocks to the provided
+ OutputStream.This step is repeated for each BlockWritable instance, in + this order: + +
|
+ |
+
+
| Step | +Description | +
|---|---|
| 1 | +The
+ PropertyTable
+ calls
+ setIndex
+ for each of its
+ Property
+ instances, so that each Property now knows its index within the PropertyTable's List
+ of Property instances.
+ |
+
| 2 | +The + PropertyTable + requests the + PropertyBlock + class to create an array of + PropertyBlock + instances. + | +
| 3 | + +The
+ PropertyBlock
+ calculates the number of empty
+ Property
+ instances it needs to create and creates them. The algorithm for the number to create is:
+ + |
+
| 4 | +The
+ PropertyBlock
+ creates the required number of
+ PropertyBlock
+ instances from the
+ List
+ of
+ Property
+ instances, including the newly created empty
+ Property
+ instances.
+ |
+
| 5 | +The
+ PropertyTable
+ calls
+ preWrite
+ on each of its
+ Property
+ instances. For
+ DocumentProperty
+ instances, this call is a no-op. For the RootProperty, the
+ action taken is shown in the RootProperty preWrite scenario
+ diagram.
+ |
+
+
+
| Step | +Description | +|
|---|---|---|
| 1 | +The
+ RootProperty
+ sets its child property with the index of the child Property that is
+ first in its List of children.
+ |
+ |
| 2 | +The
+ RootProperty
+ sets its child's next property field with the index of the child's next sibling in the
+ RootProperty's
+ List
+ of children. If the child is the last in the
+ List, its next property field is set to -1.
+ |
+ These two steps are repeated for each File in
+ the
+ RootProperty's
+ List
+ of children.
+ |
+
| 3 | +The
+ RootProperty
+ sets its child's previous property field with a value of
+ -1.
+ |
+ |
It is possible for one OLE 2 based document to have other + OLE 2 documents embedded in it. For example, an Excel file + may have a Word document and a PowerPoint slideshow + embedded as part of it.
+Normally, these other documents are stored in subdirectories + of the OLE 2 (POIFS) filesystem. The exact location of the + embedded documents will vary depending on the type of the + master document, and the exact directory names will differ + each time. To figure out exactly which directory to look + in, you will either need to process the appropriate OLE 2 + linking entry in the master document, or simple iterate + over all the directories in the filesystem.
+As a general rule, you will find the same OLE 2 entries + in the subdirectories, as you would've found at the root + of the filesystem were a document to not be embedded.
+ +Excel normally stores embedded files in subdirectories + of the filesystem root. Typically these subdirectories + are named starting with MBD, with 8 hex characters following.
+Word normally stores embedded files in subdirectories + of the ObjectPool directory, itself a subdirectory of the + filesystem root. Typically these subdirectories and named + starting with an underscore, followed by 10 numbers.
+PowerPoint does not normally store embedded files
+ in the OLE2 layer. Instead, they are held within records
+ of the main PowerPoint file.
+
See the HSLF Tutorial
+ for how to retrieve embedded OLE objects from a presentation
POIFS provides a simple tool for listing the contents of + OLE2 files. This can allow you to see what your POIFS file + contents, and hence if it has any embedded documents in it, + and where.
+The tool to use is org.apache.poi.poifs.dev.POIFSLister. + This tool may be run from the command line, and takes a filename + as its parameter. It will print out all the directories and + files contained within the POIFS file.
+All of the POIDocument classes (HSSFWorkbook, HSLFSlideShow, + HWPFDocument and HDGFDiagram) can either be opened from + a POIFSFileSystem, or from a specific directory within a + POIFSFileSystem. So, to open embedded files, simply locate the + appropriate DirectoryNode that represents the subdirectory + of interest, and pass this + the overall POIFSFileSystem to + the constructor.
+I you want to extract the textual contents of the embedded file, + then open the appropriate POIDocument, and then pass this to + the extractor class, instead of simply passing the POIFSFilesystem + to the extractor.
+POIFS file systems are essentially normal files stored on a + Java-compatible platform's native file system. They are + typically identified by names ending in a four character + extension noting what type of data they contain. For + example, a file ending in ".xls" would likely + contain spreadsheet data, and a file ending in + ".doc" would probably contain a word processing + document. POIFS file systems are called "file + system", because they contain multiple embedded files + in a manner similar to traditional file systems. Along + functional lines, it would be more accurate to call these + POIFS archives. For the remainder of this document it is + referred to as a file system in order to avoid confusion + with the "files" it contains.
+POIFS file systems are compatible with those document + formats used by a well-known software company's popular + office productivity suite and programs outputting + compatible data. Because the POIFS file system does not + provide compression, encryption or any other worthwhile + feature, its not a good choice unless you require + interoperability with these programs.
+The POIFS file system does not encode the documents + themselves. For example, if you had a word processor file + with the extension ".doc", you would actually + have a POIFS file system with a document file archived + inside of that file system.
+Note - this document is a good overview and explanation of + the file format, but for the very nitty-gritty details, + you should refer to + [MS-CFB].pdf + in the (now public) Microsoft Documentation.
+This document utilizes the numeric types as described by + the Java Language Specification, which can be found at + https://java.sun.com. In + short:
+The Java Language Specification spells out a number of + other types that are not referred to by this document.
+Where this document makes references to "endian
+ conversion" it is referring to the byte order of
+ stored numbers. Numbers in "little-endian order"
+ are stored with the least significant byte first. In
+ order to properly read a short, for example, you'd read two
+ bytes and then shift the second byte 8 bits to the left
+ before performing an or operation to it
+ against the first byte. The following code illustrates this
+ method:
This is a walkthrough of a POIFS file system and how it is + put together. It is not intended to give a concise + description but to give a "big picture" of the + general structure and how it's interpreted.
+A POIFS file system begins with a header. This header + identifies locations in the file by function and provides a + sanity check identifying a file as a POIFS file system.
+The first 64 bits of the header compose a magic number + identifier. This identifier tells the client software + that this is indeed a POIFS file system and that it should + be treated as such. This is a "sanity check" to + make sure this is a POIFS file system and not some other + format. The header also contains an array of block + numbers. These block numbers refer to blocks in the + file. When these blocks are read together they form the + Block Allocation Table. The header also contains a + pointer to the first element in the property table, + also known as the root element, and a pointer to the + small Block Allocation Table (SBAT).
+The block allocation table or BAT, along with + the property table, specify which blocks in the file + system belong to which files. After the header block, the + file system is divided into identically sized blocks of + data, numbered from 0 to however many blocks there are in + the file system. For each file in the file system, its + entry in the property table includes the index of the first + block in the array of blocks. Each block's index into the + array of blocks is also its index into the BAT, and the + integer value stored at that index in the BAT gives the + index of the next block in the array (and thus the index of + the next BAT value). A special value is stored in the BAT + to indicate "end of file".
+The property table is essentially the directory + storage for the file system. It consists of the name of the + file or directory, its start block in both the file + system and BAT, and its actual size. The first + property in the property table is the root + element. It has two purposes: to be a directory entry + (the root of the directory tree, to be specific), and to + hold the start block for the small block data.
+Small block data is a special file that contains the data + for small files (less than 4K bytes). It subdivides its + blocks into smaller blocks and there is a special small + block allocation table that, like the main BAT for larger + files, is used to map a small file to its small blocks.
+The POIFS file system begins with a header
+ block. The first 64 bits of the header form a long
+ file type id or magic number identifier of
+ 0xE11AB1A1E011CFD0L. This is basically a
+ sanity check. If this isn't the first thing in the header
+ (and consequently the file system) then this is not a
+ POIFS file system and should be read with some other
+ library.
It's important to know the most important parts of the + header. These are discussed in the rest of this + section.
+At offset 0x2C is an int specifying the number + of elements in the BAT array. The array at + 0x4C an array of ints. This array contains the + indices of every block in the Block Allocation + Table.
+Very large POIFS archives may have more blocks than can + be addressed by the BAT blocks enumerated in the header + block. How large? Well, the BAT array in the header can + contain up to 109 BAT block indices; each BAT block + references up to 128 blocks, and each block is 512 + bytes, so we're talking about 109 * 128 * 512 = + 6.8MB. That's a pretty respectable document! But, you + could have much more data than that, and in today's + world of cheap gigabyte drives, why not? So, the BAT + may be extended in that event. The integer value at + offset 0x44 of the header is the index of the + first extended BAT (XBAT) block. At offset + 0x48 of the header, there is an int value that + specifies how many XBAT blocks there are. The XBAT + blocks begin at the specified index into the array of + blocks making up the POIFS file system, and are chained + for the specified count of XBAT blocks.
+Each XBAT block contains the indices of up to 127 BAT + blocks, so the document size can be expanded by another + ~8MB for each XBAT block. The BAT blocks indexed by an + XBAT block are appended to the end of the list of BAT + blocks enumerated in the header block. Thus the BAT + blocks enumerated in the header block are BAT blocks 0 + through 108, the BAT blocks enumerated in the first + XBAT block are BAT blocks 109 through 235, the BAT + blocks enumerated in the second XBAT block are BAT + blocks 236 through 362, and so on.
+While a normal BAT block holds 128 entries, each XBAT + only references 127 BAT blocks. The last, 128th entry + in an XBAT is the offset to the next XBAT block in the + chain (or -1 if this is the last XBAT).
+Through the use of XBAT blocks, the limit on the + overall document size is that imposed by the 4-byte + block indices; if the indices are unsigned ints, the + maximum file size is 2 terabytes, 1 terabyte if the + indices are treated as signed ints. Either way, I have + yet to see a disk drive large enough to accommodate + such a file on the shelves at the local office supply + stores.
+If a file contained in a POIFS archive is smaller than + 4096 bytes, it is stored in small blocks. Small blocks + are 64 bytes in length and are contained within big + blocks, up to 8 to a big block. As the main BAT is used + to navigate the array of big blocks, so the small + block allocation table is used to navigate the + array of small blocks. The SBAT's start block index is + found at offset 0x3C of the header block, and + remaining blocks constituting the SBAT are found by + walking the main BAT as if it were an ordinary file in + the POIFS file system (this process is described + below).
+An integer at address 0x30 specifies the start + index of the property table. This integer is specified + as a "block index". The Property Table + is stored, as is almost everything in a POIFS file + system, in big blocks and walked via the BAT. The + Property Table is described below.
+The property table is essentially nothing more than the + directory system. Properties are 128 byte records + contained within the 512 byte blocks. The first property + is always the Root Entry. The following applies to + individual properties within a property table:
+The Root Entry in the Property Table + contains the information necessary to read and write + small files, which are files less than 4096 bytes + long. The start block field of the Root Entry is the + start index of the Small Block Array, which is + read like any other file in the POIFS file system. Since + the SBAT cannot be used without the Small Block Array, + the Root Entry MUST be read or written using the Block + Allocation Table. The blocks making up the Small + Block Array are divided into 64-byte small blocks, up to + the size indicated in the Root Entry (which should always + be a multiple of 64).
+The individual properties form a directory tree, with the + Root Entry as the directory tree's root, as shown + in the accompanying drawing. Note the numbers in + parentheses in each node; they represent the node's index + in the array of properties. The NEXT_PROP, + PREVIOUS_PROP, and CHILD_PROP fields hold + these indices, and are used to navigate the tree.
+
Each directory entry (i.e., a property whose type is + directory or root entry) uses its + CHILD_PROP field to point to one of its + subordinate (child) properties. It doesn't seem to matter + which of its children it points to. Thus in the previous + drawing, the Root Entry's CHILD_PROP field may contain 1, + 4, or the index of one of its other children. Similarly, + the directory node (index 1) may have, in its CHILD_PROP + field, 2, 3, or the index of one of its other + children.
+The children of a given directory property point to each + other in a similar fashion by using their + NEXT_PROP and PREVIOUS_PROP fields.
+Unused NEXT_PROP, PREVIOUS_PROP, and + CHILD_PROP fields contain the marker value of + -1. All file properties have a value of -1 for their + CHILD_PROP fields for example.
+The BAT blocks are pointed at by the bat array + contained in the header and supplemented, if necessary, + by the XBAT blocks. These blocks form a large + table of integers. These integers are block numbers. The + Block Allocation Table holds chains of integers. + These chains are terminated with -2. The elements in + these chains refer to blocks in the files. The starting + block of a file is NOT specified in the BAT. It is + specified by the property for a given file. The + elements in this BAT are both the block number (within + the file minus the header) and the number of the + next BAT element in the chain. This can be thought of as + a linked list of blocks. The BAT array contains the links + from one block to the next, including the end of chain + marker.
+Here's an example: Let's assume that the BAT begins as + follows:
+BAT[ 0 ] = 2
BAT[ 1 ] = 5
BAT[ 2 ] = 3
BAT[ 3 ] = 4
BAT[ 4 ] = 6
BAT[ 5 ] = -2
BAT[ 6 ] = 7
BAT[ 7 ] = -2
...
Now, if we have a file whose Property Table entry says it + begins with index 0, we walk the BAT array and see that + the file consists of blocks 0 (because the start block is + 0), 2 (because BAT[ 0 ] is 2), 3 (BAT[ 2 ] is 3), 4 (BAT[ + 3 ] is 4), 6 (BAT[ 4 ] is 6), and 7 (BAT[ 6 ] is 7). It + ends at block 7 because BAT[ 7 ] is -2, which is the end + of chain marker.
+Similarly, a file beginning at index 1 consists of + blocks 1 and 5.
+Other special numbers in a BAT array are:
+The following outlines the basic file system structures.
+| Field | +Description | +Offset | +Length | +Default value or const | +
| FILETYPE | +Magic number identifying this as a POIFS file + system. | +0x0000 | +Long | +0xE11AB1A1E011CFD0 | +
| UK1 | +Unknown constant | +0x0008 | +Integer | +0 | +
| UK2 | +Unknown Constant | +0x000C | +Integer | +0 | +
| UK3 | +Unknown Constant | +0x0014 | +Integer | +0 | +
| UK4 | +Unknown Constant (revision?) | +0x0018 | +Short | +0x003B | +
| UK5 | +Unknown Constant (version?) | +0x001A | +Short | +0x0003 | +
| UK6 | +Unknown Constant | +0x001C | +Short | +-2 | +
| LOG_2_BIG_BLOCK_SIZE | +Log, base 2, of the big block size | +0x001E | +Short | +9 (2 ^ 9 = 512 bytes) | +
| LOG_2_SMALL_BLOCK_SIZE | +Log, base 2, of the small block size | +0x0020 | +Integer | +6 (2 ^ 6 = 64 bytes) | +
| UK7 | +Unknown Constant | +0x0024 | +Integer | +0 | +
| UK8 | +Unknown Constant | +0x0028 | +Integer | +0 | +
| BAT_COUNT | +Number of elements in the BAT array | +0x002C | +Integer | +required | +
| PROPERTIES_START | +Block index of the first block of the property + table | +0x0030 | +Integer | +required | +
| UK9 | +Unknown Constant | +0x0034 | +Integer | +0 | +
| UK10 | +Unknown Constant | +0x0038 | +Integer | +0x00001000 | +
| SBAT_START | +Block index of first big block containing the small + block allocation table (SBAT) | +0x003C | +Integer | +-2 | +
| SBAT_Block_Count | +Number of big blocks holding the SBAT | +0x0040 | +Integer | +1 | +
| XBAT_START | +Block index of the first block in the Extended Block + Allocation Table (XBAT) | +0x0044 | +Integer | +-2 | +
| XBAT_COUNT | +Number of elements in the Extended Block Allocation + Table (to be added to the BAT) | +0x0048 | +Integer | +0 | +
| BAT_ARRAY | +Array of block indices constituting the Block + Allocation Table (BAT) | +0x004C, 0x0050, 0x0054 ... 0x01FC | +Integer[] | +-1 for unused elements, at least first element must + be filled. | +
| N/A | +Header block data not otherwise described in this + table | +N/A | +N/A | +-1 | +
| + Field + | ++ Description + | ++ Offset + | ++ Length + | ++ Default value or const + | +
| BAT_ELEMENT | +Any given element in the BAT block | +0x0000, 0x0004, 0x0008, ... 0x01FC | +Integer | +
+ -1 = unused + -2 = end of chain + -3 = special (e.g., BAT block) + All other values point to the next element in the + chain and the next index of a block composing the + file. + |
+
| Field | +Description | +Offset | +Length | +Default value or const | +
| Properties[] | +This block contains the properties. | +0x0000, 0x0080, 0x0100, 0x0180 | +128 bytes | +All unused space is set to -1. | +
| Field | +Description | +Offset | +Length | +Default value or const | +
| NAME | +A unicode null-terminated uncompressed 16bit string + (lose the high bytes) containing the name of the + property. | +0x00, 0x02, 0x04, ... 0x3E | +Short[] | +0x0000 for unused elements, field required, 32 + (0x40) element max | +
| NAME_SIZE | +Number of characters in the NAME field | +0x40 | +Short | +Required | +
| PROPERTY_TYPE | +Property type (directory, file, or root) | +0x42 | +Byte | +1 (directory), 2 (file), or 5 (root entry) | +
| NODE_COLOR | +Node color | +0x43 | +Byte | +0 (red) or 1 (black) | +
| PREVIOUS_PROP | +Previous property index | +0x44 | +Integer | +-1 | +
| NEXT_PROP | +Next property index | +0x48 | +Integer | +-1 | +
| CHILD_PROP | +First child property index | +0x4c | +Integer | +-1 | +
| SECONDS_1 | +Seconds component of the created timestamp? | +0x64 | +Integer | +0 | +
| DAYS_1 | +Days component of the created timestamp? | +0x68 | +Integer | +0 | +
| SECONDS_2 | +Seconds component of the modified timestamp? | +0x6C | +Integer | +0 | +
| DAYS_2 | +Days component of the modified timestamp? | +0x70 | +Integer | +0 | +
| START_BLOCK | +Starting block of the file, used as the first block + in the file and the pointer to the next block from + the BAT | +0x74 | +Integer | +Required | +
| SIZE | +Actual size of the file this property points + to. (used to truncate the blocks to the real + size). | +0x78 | +Integer | +0 | +
This document describes how to use the POIFS APIs to read, write, and modify files that employ a + POIFS-compatible data structure to organize their content. +
+This document is intended for Java developers who need to use the POIFS APIs to read, write, or + modify files that employ a POIFS-compatible data structure to organize their content. It is not + necessary for developers to understand the POIFS data structures, and an explanation of those data + structures is beyond the scope of this document. It is expected that the members of the target + audience will understand the rudiments of a hierarchical file system, and familiarity with the event + pattern employed by Java APIs such as AWT would be helpful. +
+This document attempts to be consistent in its terminology, which is defined here:
+The POIFS API provides ways to read, modify and write files and streams that employ a POIFS-compatible + data structure to organize their content. The following use cases are covered: +
+This section covers reading a file system. There are two ways to read a file system; these techniques are + sketched out in the following table, and then explained in greater depth in the sections following the + table. +
+In this technique for reading, certain key structures are loaded into memory, and the entire + directory tree can be walked by the application, reading specific documents at leisure. +
+If you create a POIFSFileSystem instance from a File, the memory footprint is very small. However, if + you createa a POIFSFileSystem instance from an input stream, then the whole contents must be + buffered into memory to allow random access. As such, you should budget on memory use of up to 20% + of the file size when using a File, or up to 120% of the file size when using an InputStream. +
+ +Before an application can read a file from the file system, the file system needs to be opened
+ and core parts processed. This is done using the
+ org.apache.poi.poifs.filesystem.POIFSFileSystem
+ class. Once the file system has been loaded into memory, the application may need the root
+ directory. The following code fragment will accomplish this preparation stage:
+
Assuming no exception was thrown, the file system can then be read.
+Once the file system has been loaded into memory and the root directory has been obtained, the
+ root directory can be read. The following code fragment shows how to read the entries in an
+ org.apache.poi.poifs.filesystem.DirectoryEntry
+ instance:
+
There are a couple of ways to read a document, depending on whether the document resides in the
+ root directory or in another directory. Either way, you will obtain an
+ org.apache.poi.poifs.filesystem.DocumentInputStream
+ instance.
+
The DocumentInputStream class is a simple implementation of InputStream that makes a few + guarantees worth noting: +
+available()
+ always returns the number of bytes in the document from your current position in the
+ document.
+ markSupported()
+ returns true.
+ mark(int limit)
+ ignores the limit parameter; basically the method marks the current position in the
+ document.
+ reset()
+ takes you back to the position when mark() was last called, or to the
+ beginning of the document if mark() has not been called.
+ skip(long n)
+ will take you to your current position + n (but not past the end of the document).
+ The behavior of available means you can read in a document in a single read call
+ like this:
+
The combination of mark, reset, and skip provide the
+ basic mechanisms needed for random access of the document contents.
+
If the document resides in the root directory, you can obtain a DocumentInputStream
+ like this:
+
A more generic technique for reading a document is to obtain an
+ org.apache.poi.poifs.filesystem.DirectoryEntry
+ instance for the directory containing the desired document (recall that you can use
+ getRoot()
+ to obtain the root directory from its file system). From that DirectoryEntry, you can
+ then obtain a DocumentInputStream like this:
+
The event-driven API for reading documents is a little more complicated and requires that your + application know, in advance, which files it wants to read. The benefit of using this API is that + each document is in memory just long enough for your application to read it, and documents that you + never read at all are not in memory at all. When you're finished reading the documents you wanted, + the file system has no data structures associated with it at all and can be discarded. +
+The preparation phase involves creating an instance of
+ org.apache.poi.poifs.eventfilesystem.POIFSReader
+ and to then register one or more
+ org.apache.poi.poifs.eventfilesystem.POIFSReaderListener
+ instances with the POIFSReader.
+
+ org.apache.poi.poifs.eventfilesystem.POIFSReaderListener
+ is an interface used to register for documents. When a matching document is read by the
+ org.apache.poi.poifs.eventfilesystem.POIFSReader, the POIFSReaderListener instance
+ receives an org.apache.poi.poifs.eventfilesystem.POIFSReaderEvent instance, which
+ contains an open DocumentInputStream and information about the document.
+
A POIFSReaderListener instance can register for individual documents, or it can
+ register for all documents; once it has registered for all documents, subsequent (and previous!)
+ registration requests for individual documents are ignored. There is no way to unregister
+ a POIFSReaderListener.
+
Thus, it is possible to register a single POIFSReaderListener for multiple documents
+ - one, some, or all documents. It is guaranteed that a single POIFSReaderListener will
+ receive exactly one notification per registered document. There is no guarantee as to the order
+ in which it will receive notification of its documents, as future implementations of
+ POIFSReader
+ are free to change the algorithm for walking the file system's directory structure.
+
It is also permitted to register more than one POIFSReaderListener for the same
+ document. There is no guarantee of ordering for notification of POIFSReaderListener instances
+ that have registered for the same document when POIFSReader processes that
+ document.
+
It is guaranteed that all notifications occur in the same thread. A future enhancement may be
+ made to provide multi-threaded notifications, but such an enhancement would very probably be
+ made in a new reader class, a ThreadedPOIFSReader perhaps.
+
The following describes the three ways to register a POIFSReaderListener for a
+ document or set of documents:
+
The org.apache.poi.poifs.filesystem.POIFSDocumentPath class is used to describe a
+ directory in a POIFS file system. Since there are no reserved characters in the name of a file
+ in a POIFS file system, a more traditional string-based solution for describing a directory,
+ with special characters delimiting the components of the directory name, is not feasible. The
+ constructors for the class are used as follows:
+
| + Constructor example + | ++ Directory described + | +
| new POIFSDocumentPath() | +The root directory. | +
| new POIFSDocumentPath(null) | +The root directory. | +
| new POIFSDocumentPath(new String[ 0 ]) | +The root directory. | +
| new POIFSDocumentPath(new String[ ] { "foo", "bar"} ) | +in Unix terminology, "/foo/bar". | +
| new POIFSDocumentPath(new POIFSDocumentPath(new String[] { "foo" }), new String[ ] { + "fu", "bar"} ) + | +in Unix terminology, "/foo/fu/bar". | +
Processing org.apache.poi.poifs.eventfilesystem.POIFSReaderEvent events is
+ relatively easy. After all of the POIFSReaderListener instances have been
+ registered with POIFSReader, the POIFSReader.read(InputStream stream) method
+ is called.
+
Assuming that there are no problems with the data, as the POIFSReader processes the
+ documents in the specified InputStream's data, it calls registered
+ POIFSReaderListener
+ instances' processPOIFSReaderEvent method with a POIFSReaderEvent
+ instance.
+
The POIFSReaderEvent instance contains information to identify the document (a
+ POIFSDocumentPath
+ object to identify the directory that the document is in, and the document name), and an
+ open DocumentInputStream instance from which to read the document.
+
Writing a file system is very much like reading a file system in that there are multiple ways to do so. + You can load an existing file system into memory and modify it (removing files, renaming files) and/or + add new files to it, and write it, or you can start with a new, empty file system: +
+There are two restrictions on the names of files in a file system that must be considered when + creating files: +
+A document can be created by acquiring a DirectoryEntry and calling one of the two
+ createDocument
+ methods:
+
Unlike reading, you don't have to choose between the in-memory and event-driven writing models; both + can co-exist in the same file system. +
+Writing is initiated when the POIFSFileSystem instance's writeFilesystem() method
+ is called with an OutputStream to write to.
+
The event-driven model is quite similar to the event-driven model for reading, in that the file
+ system calls your org.apache.poi.poifs.filesystem.POIFSWriterListener when it's time to
+ write your document, just as the POIFSReader calls your POIFSReaderListener
+ when it's time to read your document. Internally, when writeFilesystem() is
+ called, the final POIFS data structures are created and are written to the specified
+ OutputStream. When the file system needs to write a document out that was created with
+ the event-driven model, it calls the POIFSWriterListener back, calling its
+ processPOIFSWriterEvent()
+ method, passing an org.apache.poi.poifs.filesystem.POIFSWriterEvent instance.
+ This object contains the POIFSDocumentPath and name of the document, its size, and an
+ open org.apache.poi.poifs.filesystem.DocumentOutputStream to which to write. A
+ DocumentOutputStream
+ is a wrapper over the OutputStream that was provided to the
+ POIFSFileSystem
+ to write to, and has the responsibility of making sure that the document your application
+ writes fits within the size you specified for it.
+
If you are using a POIFSFileSystem loaded from a
+ File
+ with readOnly set to false, it is also possible to do an in-place write. Simply call
+ writeFilesystem()
+ to have the (limited) in-memory structures synced with the disk, then close() to
+ finish.
+
Creating a directory is similar to creating a document, except that there's only one way to do so: +
+As with reading documents, it is possible to create a new document or directory in the root directory + by using convenience methods of POIFSFileSystem. +
+| + DirectoryEntry Method Signature + | ++ POIFSFileSystem Method Signature + | +
| createDocument(String name, InputStream stream) | +createDocument(InputStream stream, String name) | +
| createDocument(String name, int size, POIFSWriterListener writer) | +createDocument(String name, int size, POIFSWriterListener writer) | +
| createDirectory(String name) | +createDirectory(String name) | +
It is possible to modify an existing POIFS file system, whether it's one your application has loaded into + memory, or one which you are creating on the fly. +
+Removing a document is simple: you get the Entry corresponding to the document and call
+ its delete() method. This is a boolean method, but should always return
+ true, indicating that the operation succeeded.
+
Removing a directory is also simple: you get the Entry corresponding to the directory
+ and call its delete() method. This is a boolean method, but, unlike deleting a
+ document, may not always return true, indicating that the operation succeeded. Here are
+ the reasons why the operation may fail:
+
isEmpty() on its
+ DirectoryEntry; is the return value false?)
+ There are two ways available to change the contents of an existing file within a POIFS file system.
+ One is using a DocumentOutputStream, the other is with
+ POIFSDocument.replaceContents
+
If you have available to you an InputStream to read the new File contents from, then the
+ easiest way is via
+ POIFSDocument.replaceContents. You would do something like:
+
Alternately, if you either have a byte array, or you need to write as you go along, then the
+ OutputStream interface provided by
+ DocumentOutputStream
+ will likely be a better bet. Your code would want to look somewhat like:
+
For an example of an in-place change to one stream within a file, you can see the example + + org/apache/poi/hpsf/examples/ModifyDocumentSummaryInformation.java + +
+Regardless of whether the file is a directory or a document, it can be renamed, with one exception -
+ the root directory has a special name that is expected by the components of a major software
+ vendor's office suite, and the POIFS API will not let that name be changed. Renaming is done by
+ acquiring the file's corresponding Entry instance and calling its renameTo method,
+ passing in the new name.
+
Like delete, renameTo returns true if the operation succeeded,
+ otherwise false. Reasons for failure include these:
+
POIFS is a pure Java implementation of the OLE 2 Compound + Document format.
+By definition, all APIs developed by the POI project are + based somehow on the POIFS API.
+A common confusion is on just what POIFS buys you or what OLE + 2 Compound Document format is exactly. POIFS does not buy you + DOC, or XLS, but is necessary to generate or read DOC or XLS + files. You see, all file formats based on the OLE 2 Compound + Document Format have a common structure. The OLE 2 Compound + Document Format is essentially a convoluted archive + format. Think of POIFS as a "zip" library. Once you can get + the data in a zip file you still need to interpret the + data. As a general rule, while all of our formats use + POIFS, most of them attempt to abstract you from it. There + are some circumstances where this is not possible, but as a + general rule this is true.
+If you're an end user type just looking to generate XLS + files, then you'd be looking for HSSF not POIFS; however, if + you have legacy code that uses MFC property sets, POIFS is + for you! Regardless, you may or may not need to know how to + use POIFS but ultimately if you use technologies that come + from the POI project, you're using POIFS underneath. Perhaps + we should have a branding campaign "POIFS Inside!". ;-)
+ +| Primary Actor: | +POIFS client | +
| Scope: | +POIFS | +
| Level: | +Summary | +
| Stakeholders and Interests: | +
+ POIFS client- wants to read content of file
+ system + POIFS - understands POIFS file system + |
+
| Precondition: | +None | +
| Minimal Guarantee: | +None | +
| Main Success Guarantee: | +
+ 1. POIFS client requests POIFS to read a POIFS file
+ system, providing an
+ InputStream
+ containing POIFS file system in question.+ 2. POIFS reads from the + InputStream in
+ 512 byte blocks.+ 3. POIFS verifies that the first block begins with + the well known signature + ( + 0xE11AB1A1E011CFD0)+ 4. POIFS reads the Block Allocation Table from the + first block and, if necessary, from the XBAT + blocks. + 5. POIFS obtains the start block of the Property + Table and reads the Property Table (use case 9, + read file) + 6. POIFS reads the individual entries in the Property + Table + 7. POIFS obtains the start block of the Small Block + Allocation Table and reads the Small Block + Allocation Table (use case 9, read file) + 8. POIFS obtains the start block of the Small Block + store from the first entry in the Property Table + and reads the Small Block Array (use case 9, read + file) + |
+
| Extensions: | +
+ 2a. If the last block read is not a 512 byte
+ block, the
+ InputStream is not that of
+ a POIFS file system, and POIFS throws an
+ appropriate exception.
+ + 3a. If the signature is incorrect, the + InputStream is not that of a POIFS
+ file system, and POIFS throws an appropriate
+ exception.+ |
+
| Primary Actor: | +POIFS client | +
|---|---|
| Scope: | +POIFS | +
| Level: | +Summary | +
| Stakeholders and Interests: | +
+ POIFS client- wants to write file system out. + POIFS - knows how to write file system out. + |
+
| Precondition: | +
+ File system has been read (use case 1, read
+ existing file system) and subsequently modified
+ (use case 4, replace file in file system; use case
+ 5, delete file from file system; or use case 6,
+ write new file to file system; in any
+ combination)
+ or + File system has been created (use case 3, create + new file system) + |
+
| Minimal Guarantee: | +None | +
| Main Success Guarantee: | +
+ 1. POIFS client provides an
+ OutputStream
+ to write the file system to.
+ + 2. POIFS gets the sizes of the Property Table and + each file in the file system. + 3. If any files in the file system requires storage + in a Small Block Array, POIFS creates a Small + Block Array of sufficient size to hold all of the + small files. + 4. POIFS calculates the number of big blocks needed + to hold all of the large files, the Property + Table, and, if necessary, the Small Block Array + and the Small Block Allocation Table. + 5. POIFS creates a set of big blocks sufficient to + store the Block Allocation Table + 6. POIFS creates and writes the header block + 7. POIFS writes out the XBAT blocks, if needed. + 8. POIFS writes out the Small Block Array, if + needed + 9. POIFS writes out the Small Block Allocation Table, + if needed + 10. POIFS writes out the Property Table + 11. POIFS writes out the large files, if needed + 12. POIFS closes the OutputStream.
+ |
+
| Extensions: | +
+ 6a. Exceptions writing to the
+ OutputStream will be propagated back
+ to the POIFS client.
+ + 7a. Exceptions writing to the + OutputStream will be propagated back
+ to the POIFS client.
+ + 8a. Exceptions writing to the + OutputStream will be propagated back
+ to the POIFS client.
+ + 9a. Exceptions writing to the + OutputStream will be propagated back
+ to the POIFS client.
+ + 10a. Exceptions writing to the + OutputStream will be propagated back
+ to the POIFS client.
+ + 11a. Exceptions writing to the + OutputStream will be propagated back
+ to the POIFS client.
+ + 12a. Exceptions closing the + OutputStream will be propagated back
+ to the POIFS client.
+ + |
+
| Primary Actor: | +POIFS client | +
|---|---|
| Scope: | +POIFS | +
| Level: | +Summary | +
| Stakeholders and Interests: | +
+ POIFS client- wants to create a new file
+ system + POIFS - knows how to create a new file system + |
+
| Precondition: | +None | +
| Minimal Guarantee: | +None | +
| Main Success Guarantee: | ++ POIFS creates an empty Property Table. + | +
| Extensions: | +None | +
| Primary Actor: | +POIFS client | +
| Scope: | +POIFS | +
| Level: | +Summary | +
| Stakeholders and Interests: | +
+ 1. POIFS client- wants to replace an existing file in
+ the file system + 2. POIFS - knows how to manage the file system + |
+
| Precondition: | +
+ Either
+ + The file system has been read (use case 1, read + existing file system) and a file has been + extracted from the file system (use case 7, read + existing file from file system) + or + The file system has been created (use case 3, + create new file system) and a file has been + written to the file system (use case 6, write new + file to file system) + |
+
| Minimal Guarantee: | +None | +
| Main Success Guarantee: | +
+ 1. POIFS discards storage of the existing file. + 2. POIFS updates the existing file's entry in the + Property Table + 3. POIFS stores the new file's data + |
+
| Extensions: | ++ 1a. POIFS throws an exception if the file does not + exist. + | +
| Primary Actor: | +POIFS client | +
| Scope: | +POIFS | +
| Level: | +Summary | +
| Stakeholders and Interests: | +
+ * POIFS client- wants to remove a file from a file
+ system + * POIFS - knows how to manage the file system + |
+
| Precondition: | +
+ Either + The file system has been read (use case 1, read + existing file system) and a file has been + extracted from the file system (use case 7, read + existing file from file system) + + or + + The file system has been created (use case 3, + create new file system) and a file has been + written to the file system (use case 6, write new + file to file system) + |
+
| Minimal Guarantee: | +None | +
| Main Success Guarantee: | +
+ 1. POIFS discards the specified file's storage. + 2. POIFS discards the file's Property Table + entry. + |
+
| Extensions: | ++ 1a. POIFS throws an exception if the file does not + exist. + | +
| Primary Actor: | +POIFS client | +
| Scope: | +POIFS | +
| Level: | +Summary | +
| Stakeholders and Interests: | +
+ * POIFS client- wants to add a new file to the file
+ system + * POIFS - knows how to manage the file system + |
+
| Precondition: | +The specified file does not yet exist in the file + system | +
| Minimal Guarantee: | +None | +
| Main Success Guarantee: | +
+ 1. The POIFS client provides a file name + 2. POIFS creates a new Property Table entry for the + new file + 3. POIFS provides the POIFS client with an + OutputStream to write to.+ 4. The POIFS client writes data to the provided + OutputStream.+ 5. The POIFS client closes the provided + OutputStream+ 6. POIFS updates the Property Table entry with the + new file's size + |
+
| Extensions: | +
+ 1a. POIFS throws an exception if a file with the
+ specified name already exists in the file
+ system. + 1b. POIFS throws an exception if the file name is + too long. The limit on file name length is 31 + characters. + |
+
| Primary Actor: | +POIFS client | +
| Scope: | +POIFS | +
| Level: | +Summary | +
| Stakeholders and Interests: | +
+ * POIFS client- wants to read a file from the file
+ system + * POIFS - knows how to manage the file system + |
+
| Precondition: | +
+ * The file system has been read (use case 1, read
+ existing file system) or has been created and
+ written to (use case 3, create new file system;
+ use case 6, write new file to file system). + * The specified file exists in the file system. + |
+
| Minimal Guarantee: | +None | +
| Main Success Guarantee: | +
+ * The POIFS client provides the name of a file to be read + * POIFS provides an InputStream to read from. + * The POIFS client reads from the InputStream.+ * The POIFS client closes the InputStream.
+ |
+
| Extensions: | +1a. POIFS throws an exception if no file with the + specified name exists. | +
| Primary Actor: | +POIFS client | +
| Scope: | +POIFS | +
| Level: | +Summary | +
| Stakeholders and Interests: | +
+ * POIFS client- wants to know what files exist in
+ the file system + * POIFS - knows how to manage the file system + |
+
| Precondition: | +The file system has been read (use case 1, read + existing file system) or created (use case 3, create + new file system) | +
| Minimal Guarantee: | +None | +
| Main Success Guarantee: | +
+ 1. The POIFS client requests the file system
+ directory.
+ 2. POIFS returns an Iterator. The
+ Iterator will not include the root
+ entry in the Property Table, and may be an
+ Iterator over an empty
+ Collection.
+ |
+
| Extensions: | +None | +
| Primary Actor: | +POIFS | +
| Scope: | +POIFS | +
| Level: | +Summary | +
| Stakeholders and Interests: | ++ POIFS - POIFS needs to read a file, or something + resembling a file (i.e., the Property Table, the + Small Block Array, or the Small Block Allocation + Table) + | +
| Precondition: | +None | +
| Minimal Guarantee: | +None | +
| Main Success Guarantee: | +
+ 1. POIFS begins with a start block, a file size, and
+ a flag indicating whether to use the Big Block
+ Allocation Table or the Small Block Allocation
+ Table + 2. POIFS returns an InputStream.+ 3. Reads from the InputStream are
+ performed by walking the specified Block
+ Allocation Table and reading the blocks
+ indicated.+ 4. POIFS closes the InputStream when
+ finished reading the file, or its client wants to
+ close the InputStream.
+ |
+
| Extensions: | +3a. An exception will be thrown if the specified Block + Allocation Table is corrupt, as evidenced by an index + pointing to a non-existent block, or by a chain + extending past the known size of the file. | +
| Primary Actor: | +POIFS client | +
| Scope: | +POIFS | +
| Level: | +Summary | +
| Stakeholders and Interests: | +
+ * POIFS client- wants to rename an existing file in
+ the file system. + * POIFS - knows how to manage the file system. + |
+
| Precondition: | +
+ * The file system is has been read (use case 1, read
+ existing file system) or has been created and
+ written to (use case 3, create new file system;
+ use case 6, write new file to file system. + * The specified file exists in the file system. + * The new name for the file does not duplicate + another file in the file system. + |
+
| Minimal Guarantee: | +None | +
| Main Success Guarantee: | ++ 1. POIFS updates the Property Table entry for the + specified file with its new name. + | +
| Extensions: | +
+ * 1a. If the old file name is not in the file
+ system, POIFS throws an exception. + * 1b. If the new file name already exists in the + file system, POIFS throws an exception. + * 1c. If the new file name is too long (the limit is + 31 characters), POIFS throws an exception. + |
+
+ The following code demonstrates how to iterate over shapes for each slide. +
+ java.awt.headless=true
+ (either via -Djava.awt.headless=true startup parameter or via System.setProperty("java.awt.headless", "true")).
+ + When you add a shape, you usually specify the dimensions of the shape and the position + of the upper left corner of the bounding box for the shape relative to the upper left + corner of the slide. Distances in the drawing layer are measured in points (72 points = 1 inch). +
++ Currently, HSLF API supports the following types of pictures: +
++ Below is the equivalent code in PowerPoint VBA: +
++ HSLF provides a way to export slides into images. You can capture slides into java.awt.Graphics2D object (or any other) + and serialize it into a PNG or JPEG format. Please note, although HSLF attempts to render slides as close to PowerPoint as possible, + the output may look differently from PowerPoint due to the following reasons: +
++ Current Limitations: +
+HSLF is the POI Project's pure Java implementation of the Powerpoint '97(-2007) file format.
+HSLF provides a way to read, create or modify PowerPoint presentations. In particular, it provides: +
+The quick guide documentation provides + information on using this API. Comments and fixes gratefully accepted on the POI + dev mailing lists.
++ XSLF is the POI Project's pure Java implementation of the PowerPoint 2007 OOXML (.xlsx) file format. + Whilst HSLF and XSLF provide similar features, there is not a common interface across the two of them at this time. +
++ Please note that XSLF is still in early development and is a subject to incompatible changes in future. +
++ A quick guide is available in the XSLF Cookbook +
++ PowerPoint documents are made up of a tree of records. A record may + contain either other records (in which case it is a Container), + or data (in which case it's an Atom). A record can't hold both. +
++ PowerPoint documents don't have one overall container record. Instead, + there are a number of different container records to be found at + the top level. +
++ Any numbers or strings stored in the records are always stored in + Little Endian format (least important bytes first). This is the case + no matter what platform the file was written on - be that a + Little Endian or a Big Endian system. +
++ PowerPoint may have Escher (DDF) records embedded in it. These + are always held as the children of a PPDrawing record (record + type 1036). Escher records have the same format as PowerPoint + records. +
++ All records, be they containers or atoms, have the same standard + 8 byte header. It is: +
++ If the first byte of the header, BINARY_AND with 0x0f, is 0x0f, + then the record is a container. Otherwise, it's an atom. The rest + of the first two bytes are used to store the "options" for the + record. Most commonly, this is used to indicate the version of + the record, but the exact usage is record specific. +
++ The record type is a little endian number, which tells you what + kind of record you're dealing with. Each different kind of record + has its own value that gets stored here. PowerPoint records have + a type that's normally less than 6000 (decimal). Escher records + normally have a type between 0xF000 and 0xF1FF. +
++ The record length is another little endian number. For an atom, + it's the size of the data part of the record, i.e. the length + of the record less its 8 byte record header. For a + container, it's the size of all the records that are children of + this record. That means that the size of a container record is the + length, plus 8 bytes for its record header. +
+aka Records that care about the byte level position of other records
++ A small number of records contain byte level position offsets to other + records. If you change the position of any records in the file, then + there's a good chance that you will need to update some of these + special records. +
++ First up, CurrentUserAtom. This is actually stored in a different + OLE2 (POIFS) stream to the main PowerPoint document. It contains + a few bits of information on who lasted edited the file. Most + importantly, at byte 8 of its contents, it stores (as a 32 bit + little endian number) the offset in the main stream to the most + recent UserEditAtom. +
++ The UserEditAtom contains two byte level offsets (again as 32 bit + little endian numbers). At byte 12 is the offset to the + PersistPtrIncrementalBlock associated with this UserEditAtom + (each UserEditAtom has one and only one PersistPtrIncrementalBlock). + At byte 8, there's the offset to the previous UserEditAtom. If this + is 0, then you're at the first one. +
++ Every time you do a non full save in PowerPoint, it tacks on another + UserEditAtom and another PersistPtrIncrementalBlock. The + CurrentUserAtom is updated to point to this new UserEditAtom, and the + new UserEditAtom points back to the previous UserEditAtom. You then + end up with a chain, starting from the CurrentUserAtom, linking + back through all the UserEditAtoms, until you reach the first one + from a full save. +
++ The PersistPtrIncrementalBlock contains byte offsets to all the + Slides, Notes, Documents and MasterSlides in the file. The first + PersistPtrIncrementalBlock will point to all the ones that + were present the first time the file was saved. Subsequent + PersistPtrIncrementalBlocks will contain pointers to all the ones + that were changed in that edit. To find the offset to a given + sheet in the latest version, then start with the most recent + PersistPtrIncrementalBlock. If this knows about the sheet, use the + offset it has. If it doesn't, then work back through older + PersistPtrIncrementalBlocks until you find one which does, and + use that. +
++ Each PersistPtrIncrementalBlock can contain a number of entries + blocks. Each block holds information on a sequence of sheets. + Each block starts with a 32 bit little endian integer. Once read + into memory, the lower 20 bits contain the starting number for the + sequence of sheets to be described. The higher 12 bits contain + the count of the number of sheets described. Following that is + one 32 bit little endian integer for each sheet in the sequence, + the value being the offset to that sheet. If there is any data + left after parsing a block, then it corresponds to the next block. +
++ There are quite a number of records that affect the styling + of text, and a smaller number that are responsible for the + styling of paragraphs. +
++ By default, a given set of text will inherit paragraph and text + stylings from the appropriate master sheet. If anything differs + from the master sheet, then appropriate styling records will + follow the text record. +
++ (We don't currently know enough about master sheet styling + to write about it) +
++ Normally, powerpoint will have one text record (TextBytesAtom + or TextCharsAtom) for every paragraph, with a preceding + TextHeaderAtom to describe what sort of paragraph it is. + If any of the stylings differ from the master's, then a + StyleTextPropAtom will follow the text record. This contains + the paragraph style information, and the styling information + for each section of the text which has a different style. + (More on StyleTextPropAtom later) +
++ For every font used, a FontEntityAtom must exist for that font. + The FontEntityAtoms live inside a FontCollection record, and + there's one of those inside Environment record inside the + Document record. (More on Fonts to be discovered) +
++ If the text or paragraph stylings for a given text record + differ from those of the appropriate master, then there will + be one of these records. +
++ This record is made up of two lists of lists. Firstly, + there's a list of paragraph stylings - each made up of the + number of characters it applies two, followed by the matching + styling elements. Following that is the equivalent for + character stylings. +
++ Each styling list (in either list) starts with the number + of characters it applies to, stored in a 2 byte little + endian number. If it is a paragraph styling, it will be + followed by a 2 byte number (of unknown use). After this is + a four byte number, which is a mask indicating which stylings + will follow. You then have an entry for each of the stylings + indicated in the mask. Finally, you move onto the next set + of stylings. +
++ Each styling has a specific mask flag to indicate its + presence. (The list may be found towards the top of + org.apache.poi.hslf.record.StyleTextPropAtom.java, and is + too long to sensibly include here). For each styling entry + will occur in the order of its mask value (so one with mask + 1 will come first, followed by the next highest mask value). + Depending on the styling, it is either made up of a 2 byte + or 4 byte numeric value. The meaning of the value will + depend on the styling (eg for font.size, it is the font + size in points). +
++ Some stylings are actually mask stylings. For these, the + value will be a 4 byte number. This is then processed as + mask, to indicate a number of different sub-stylings. + The styling for bold/italic/underline is one such example. +
++ PowerPoint stores information about the fonts used in FontEntityAtoms, + which live inside Document.Environment.FontCollection. For every different + font used, a FontEntityAtom must exist for that font. There is always at + least one FontEntityAtom in Document.Environment.FontCollection, + which describes the default font. +
++ The instance field of the record header contains the zero based index of the + font. Font index entries in StyleTextPropAtoms will refer to their required + font via this index. +
++ The length of FontEntityAtoms is always 68 bytes. The first 64 bytes of + it hold the typeface name of the font to be used. This is stored as + a null-terminated string, and encoded as little endian unicode. (The + length of the string must not exceed 32 characters including the null + termination, so the typeface name cannot exceed 31 characters). +
+ ++ After the typeface name there are 4 bytes of bitmask flags. The details of these + can be found in the Windows API, under the LOGFONT structure. + The 65th byte is the output precision, which defines how closely the system chosen + font must match the requested font, in terms of height, width, pitch etc. + The 66th byte is the clipping precision, which defines how to clip characters + that occur partly outside the clipping region. + The 67th byte is the output quality, which defines how closely the system + must match the logical font's attributes to those of the physical font used. + The 68th (and final) byte is the pitch and family, which is used by the + system when matching fonts. +
++ For rendering slideshow (HSLF/XSLF), WMF, EMF and EMF+ pictures, POI provides an utility class + + PPTX2PNG: +
+ ++ Download the current nightly + and for SVG/PDF the additional dependencies.
+Execute the java command (Unix-paths needs to be replaced for Windows - use "-charset" for non-western WMF/EMFs):
++ If you want to use the renderer on the module path (JPMS) there a currently a few more steps necessary: +
+-Dsun.java2d.renderer=sun.java2d.marlin.MarlinRenderingEngine or for older jdk builds,
+ preload the marlin jar.
+ For file system access, you need to save your slideshow/WMF/EMF/EMF+ first to disc and then call
+ PPTX2PNG.main()
+ with the corresponding parameters.
+
for stdin access, you need to redirect System.in before:
+
For basic text extraction, make use of
+ org.apache.poi.sl.extractor.SlideShowExtractor.
+ It accepts a slideshow which can be created from a file or stream via org.apache.poi.sl.usermodel.SlideShowFactory.
+ The getText() method can be used to get the text from the slides.
+
To get specific bits of text, first create a org.apache.poi.hslf.usermodel.HSLFSlideShow
+(from a org.apache.poi.hslf.usermodel.HSLFSlideShowImpl, which accepts a file or an input
+stream). Use getSlides() and getNotes() to get the slides and notes.
+These can be queried to get their page ID (though they should be returned
+in the right order).
You can then call getTextParagraphs() on these, to get
+their blocks of text. (A list of HSLFTextParagraph normally holds all the text in a
+given area of the page, eg in the title bar, or in a box).
+From the HSLFTextParagraph, you can extract the text, and check
+what type of text it is (eg Body, Title). You can also call
+getTextRuns(), which will return the
+HSLFTextRuns that make up the TextParagraph. A
+HSLFTextRun is a text fragment, having the same character formatting.
+The paragraph formatting is defined in the parent HSLFTextParagraph.
+
If speed is the most important thing for you, you don't care
+ about getting duplicate blocks of text, you don't care about
+ getting text from master sheets, and you don't care about getting
+ old text, then
+ org.apache.poi.hslf.extractor.QuickButCruddyTextExtractor
+ might be of use.
QuickButCruddyTextExtractor doesn't use the normal record + parsing code, instead it uses a tree structure blind search + method to get all text holding records. You will get all the text, + including lots of text you normally wouldn't ever want. However, + you will get it back very very fast!
+There are two ways of getting the text back.
+ getTextAsString() will return a single string with all
+ the text in it. getTextAsVector() will return a
+ vector of strings, one for each text record found in the file.
+
It is possible to change the text via
+ HSLFTextParagraph.setText(List<HSLFTextParagraph>,String) or
+ HSLFTextRun.setText(String). It is possible to add additional TextRuns
+ with HSLFTextParagraph.appendText(List<HSLFTextParagraph>,String,boolean)
+ or HSLFTextParagraph.addTextRun(HSLFTextRun)
When calling HSLFTextParagraph.setText(List<HSLFTextParagraph>,String), all
+ the text will end up with the same formatting. When calling
+ HSLFTextRun.setText(String), the text will retain
+ the old formatting of that HSLFTextRun.
+
You may add new slides by calling
+ HSLFSlideShow.createSlide(), which will add a new slide
+ to the end of the SlideShow. It is possible to re-order slides with HSLFSlideShow.reorderSlide(...).
+
org.apache.poi.hslf.usermodel.HSLFSlideShowImpl
+ Handles reading in and writing out files. Calls
+ org.apache.poi.hslf.record.record to build a tree
+ of all the records in the file, which it allows access to.
+ org.apache.poi.hslf.record.Record
+ Base class of all records. Also provides the main record generation
+ code, which will build up a tree of records for a file.
+ org.apache.poi.hslf.usermodel.HSLFSlideShow
+ Builds up model entries from the records, and presents a user facing
+ view of the file
+ org.apache.poi.hslf.usermodel.HSLFSlide
+ A user facing view of a Slide in a slideshow. Allows you to get at the
+ Text of the slide, and at any drawing objects on it.
+ org.apache.poi.hslf.usermodel.HSLFTextParagraph
+ A list of HSLFTextParagraphs holds all the text in a given area of the Slide, and will
+ contain one or more HSLFTextRuns.
+ org.apache.poi.hslf.usermodel.HSLFTextRun
+ Holds a run of text, all having the same character stylings. It is possible to modify text, and/or text stylings.
+ org.apache.poi.sl.extractor.SlideShowExtractor
+ Uses the model code to allow extraction of text from files
+ org.apache.poi.hslf.extractor.QuickButCruddyTextExtractor
+ Uses the record code to extract all the text from files very fast,
+ but including deleted text (and other bits of Crud).
+ + This page offers a short introduction into the XSLF API. More examples can be found in the + XSLF Examples + in the POI SVN repository. +
++ The following code creates a new .pptx slide show and adds a blank slide to it: +
++ The following code demonstrates how to iterate over shapes for each slide. +
+How it works:
++ The XSLFSlide object implements a draw(Graphics2D graphics) method that recursively paints all shapes + in the slide into the supplied graphics canvas: +
++ where graphics is a class implementing java.awt.Graphics2D. In PPTX2PNG the graphic canvas is derived from + java.awt.image.BufferedImage, i.e. the destination is an image in memory, but in general case you can pass + any compliant implementation of java.awt.Graphics2D. + Find more information in the designated render page, e.g. on how to render SVG images. +
++ This document is intended as a work in progress for describing + our current understanding of how the chart records are + written to produce a valid chart. +
++ The following records detail the records written for a + 'simple' bar chart. +
++ The next section breaks those records down into an easier + to read format: +
++ Just a quick note on some of the unknown records: +
++ It is currently suspected that many of those records could be + left out when generating a bar chart from scratch. The way + we will be proceeding with this is to write code that generates + most of these records and then start removing them to see + how this effects the chart in excel. +
+Wherever possible, we have tried to ensure that you can use your + existing POI code with POI 3.5 without requiring any changes. However, + Java doesn't always make that easy, and unfortunately there are a + few changes that may be required for some users.
+Annoyingly, java will not let you access a static inner class via + a child of the parent one. So, all references to + org.apache.poi.hssf.usermodel.HSSFFormulaEvaluator.CellValue + will need to be changed to + org.apache.poi.ss.usermodel.FormulaEvaluator.CellValue +
+Annoyingly, java will not let you access a static inner class via + a child of the parent one. So, all references to + org.apache.poi.hssf.usermodel.HSSFRow.MissingCellPolicy + will need to be changed to + org.apache.poi.ss.usermodel.Row.MissingCellPolicy +
+Previously, record level errors within DDF would throw an + exception from the hssf class hierarchy. Now, record level errors + within DDF will throw a more general RecordFormatException, + org.apache.poi.util.RecordFormatException
+In addition, org.apache.poi.hssf.record.RecordFormatException + has been changed to inherit from the new + org.apache.poi.util.RecordFormatException, so you may + wish to change catches of the hssf version to the new util version. +
+If you have existing HSSF usermodel code that works just + fine, and you don't want to use the new OOXML XSSF support, + then you probably don't need to. Your existing HSSF only code + will continue to work just fine.
+However, if you want to be able to work with both HSSF for + your .xls files, and also XSSF for .xslx files, then you will + need to make some slight tweaks to your code.
+The new SS usermodel (org.apache.poi.ss.usermodel) is very + heavily based on the old HSSF usermodel + (org.apache.poi.hssf.usermodel). The main difference is that + the package name and class names have been tweaked to remove + HSSF from them. Otherwise, the new SS Usermodel interfaces + should provide the same functionality.
+Calling the empty HSSFWorkbook remains as the way to + create a new, empty Workbook object. To open an existing + Workbook, you should now call WorkbookFactory.create(inp).
+For all other cases when you would have called a + Usermodel constructor, such as 'new HSSFRichTextString()' or + 'new HSSFDataFormat', you should instead use a CreationHelper. + There's a method on the Workbook to get a CreationHelper, and + the CreationHelper will then handle constructing new objects + for you.
+For all other code, generally change a reference from + org.apache.poi.hssf.usermodel.HSSFFoo to a reference to + org.apache.poi.ss.usermodel.Foo. Method signatures should + otherwise remain the same, and it should all then work for + both XSSF and HSSF.
+
+
+
+ This section is intended for diagrams (UML/etc) that help + explain HSSF. +
++ Have more? Add a new "bug" to the bug database with [DOCUMENTATION] + prefacing the description and a link to the file on an http server + somewhere. If you don't have your own webserver, then you can email it + to (acoliver at apache dot org) provided its < 5MB. Diagrams should be + in some format that can be read at least on Linux and Windows. Diagrams + that can be edited are preferable, but lets face it, there aren't too + many good affordable UML tools yet! And no they don't HAVE to be UML... + just useful. +
++ This document is for developers wishing to contribute to the + FormulaEvaluator API functionality. +
+
+ When evaluating workbooks you may encounter an org.apache.poi.ss.formula.eval.NotImplementedException
+ which indicates that a function is not (yet) supported by POI. Is there a workaround?
+ Yes, the POI framework makes it easy to add implementation of new functions. Prior to POI-3.8
+ you had to checkout the source code from svn and make a custom build with your function implementation.
+ Since POI-3.8 you can register new functions in run-time.
+
+ Currently, contribution is desired for implementing the standard MS
+ Excel functions. Placeholder classes for these have been created,
+ contributors only need to insert implementation for the
+ individual evaluate() methods that do the actual evaluation.
+
+ Briefly, a formula string (along with the sheet and workbook that
+ form the context in which the formula is evaluated) is first parsed
+ into Reverse Polish Notation (RPN) tokens using the FormulaParser class.
+ (If you don't know what RPN tokens are, now is a good time to
+ read
+ Anthony Stone's description of RPN.)
+
+ RPN tokens are mapped to Eval classes. (The class hierarchy for the Evals
+ is best understood if you view it in a class diagram
+ viewer.) Depending on the type of RPN token (also called Ptgs
+ henceforth since that is what the FormulaParser calls the classes), a
+ specific type of Eval wrapper is constructed to wrap the RPN token and
+ is pushed on the stack, unless the Ptg is an OperationPtg. If it is an
+ OperationPtg, an OperationEval instance is created for the specific
+ type of OperationPtg. And depending on how many operands it takes,
+ that many Evals are popped of the stack and passed in an array to
+ the OperationEval instance's evaluate method which returns an Eval
+ of subtype ValueEval. Thus an operation in the formula is evaluated.
+
Eval is of subinterface ValueEval or OperationEval.
+ Operands are always ValueEvals, and operations are always OperationEvals.
+ OperationEval.evaluate(Eval[]) returns an Eval which is supposed
+ to be an instance of one of the implementations of
+ ValueEval. The ValueEval resulting from evaluate() is pushed on the
+ stack and the next RPN token is evaluated. This continues until
+ eventually there are no more RPN tokens, at which point, if the formula
+ string was correctly parsed, there should be just one Eval on the
+ stack — which contains the result of evaluating the formula.
+
+ Two special Ptgs — AreaPtg and ReferencePtg —
+ are handled a little differently, but the code should be self
+ explanatory for that. Very briefly, the cells included in AreaPtg and
+ RefPtg are examined and their values are populated in individual
+ ValueEval objects which are set into the implementations of
+ AreaEval and RefEval.
+
+ OperationEvals for the standard operators have been implemented and tested.
+
+ As of release 5.2.0, POI implements 202 built-in functions, + see Appendix A for the list of supported functions with an implementation. + You can programmatically list supported / unsupported functions using the following helper methods: +
++ If you need a function that POI doesn't currently support, you have two options. + You can create the function yourself, and have your program add it to POI at + run-time. Doing this will help you get the function you need as soon as possible. + The other option is to create the function yourself, and build it into the POI library, + possibly contributing the code to the POI project. Doing this will help you get the + function you need, but you'll have to build POI from source yourself. And if you + contribute the code, you'll help others who need the function in the future, because + it will already be supported in the next release of POI. The two options require + almost identical code, but the process of deploying the function is different. + If your function is a User Defined Function, you'll always take the run-time option, + as POI doesn't distribute UDFs. +
+
+ In the sections ahead, we'll implement the Excel SQRTPI() function, first
+ at run-time, and then we'll show how change it to a library-based implementation.
+
+ All Excel formula function classes implement either the
+ org.apache.poi.hssf.record.formula.functions.Function or the
+ org.apache.poi.hssf.record.formula.functions.FreeRefFunction interface.
+ Function is a common interface for the functions defined in the Binary Excel File Format (BIFF8): these are "classic" Excel functions like SUM, COUNT, LOOKUP, etc.
+ FreeRefFunction is a common interface for the functions from the Excel Analysis ToolPak, for User Defined Functions that you create,
+ and for Excel built-in functions that have been defined since BIFF8 was defined.
+ In the future these two interfaces are expected be unified into one, but for now you have to start your implementation from two slightly different roots.
+
+ You are about to implement a function and don't know which interface to start from: Function or FreeRefFunction.
+ You should use Function if the function is part of the Excel BIFF8
+ definition, and FreeRefFunction for a function that is part of the Excel Analysis ToolPak, was added to Excel after BIFF8, or that you are creating yourself.
+
+ You can check the list of Analysis ToolPak functions defined in org.apache.poi.ss.formula.atp.AnalysisToolPak.createFunctionsMap()
+ to see if the function is part of the Analysis ToolPak.
+ The list of BIFF8 functions is defined as a text file, in the
+ src/resources/main/org/apache/poi/ss/formula/function/functionMetadata.txt file.
+
+ You can also use the following code to check which base class your function should implement, if it is not a User Defined function (UDFs must implement FreeRefFunction):
+
+ Here is the fun part: let's walk through the implementation of the Excel function SQRTPI(),
+ which POI doesn not currently support.
+
+ AnalysisToolPak.isATPFunction("SQRTPI") returns true, so this is an Analysis ToolPak function.
+ Thus the base interface must be FreeRefFunction. The same would be true if we were implementing
+ a UDF.
+
+ Because we're taking the run-time deployment option, we'll create this new function in a source
+ file in our own program. Our function will return an Eval that is either
+ it's proper result, or an ErrorEval that describes the error. All that work
+ is done in the function's evaluate() method:
+
+ If our function had been one of the BIFF8 Excel built-ins, it would have been based on
+ the Function interface instead.
+ There are sub-interfaces of Function that make life easier when implementing numeric functions
+ or functions
+ with a small, fixed number of arguments:
+
org.apache.poi.hssf.record.formula.functions.NumericFunctionorg.apache.poi.hssf.record.formula.functions.Fixed0ArgFunctionorg.apache.poi.hssf.record.formula.functions.Fixed1ArgFunctionorg.apache.poi.hssf.record.formula.functions.Fixed2ArgFunctionorg.apache.poi.hssf.record.formula.functions.Fixed3ArgFunctionorg.apache.poi.hssf.record.formula.functions.Fixed4ArgFunction
+ Since SQRTPI() takes exactly one argument, we would start our implementation from
+ Fixed1ArgFunction. The differences for a BIFF8 Fixed1ArgFunction
+ are pretty small:
+
+ Now when the implementation is ready we need to register it with the formula evaluator. + This is the same no matter which kind of function we're creating. We simply add the + following line to the program that is using POI: +
+
+ Voila! The formula evaluator now recognizes SQRTPI()!
+
+ If we choose instead to implement our function as part of the POI
+ library, the code is nearly identical. All POI functions
+ are part of one of two Java packages: org.apache.poi.ss.formula.functions
+ for BIFF8 Excel built-in functions, and org.apache.poi.ss.formula.atp
+ for Analysis ToolPak functions. The function still needs to implement the
+ appropriate base class, just as before. To implement our SQRTPI()
+ function in the POI library, we need to move the source code to
+ poi/src/main/java/org/apache/poi/ss/formula/atp/SqrtPi.java in
+ the POI source code, change the package statement, and add a
+ singleton instance:
+
+ If our function had been one of the BIFF8 Excel built-ins, we would instead have moved
+ the source code to
+ poi/src/main/java/org/apache/poi/ss/formula/functions/SqrtPi.java in
+ the POI source code, and changed the package statement to:
+
+ POI library functions are registered differently from run-time-deployed functions.
+ Again, the techniques differ for the two types of library functions (remembering
+ that POI never releases the third type, UDFs).
+ For our Analysis ToolPak function, we have to update the list of functions in
+ org.apache.poi.ss.formula.atp.AnalysisToolPak.createFunctionsMap():
+
+ If our function had been one of the BIFF8 Excel built-ins,
+ the registration instead would require updating an entry in the formula-function table,
+ poi/src/main/resources/org/apache/poi/ss/formula/function/functionMetadata.txt:
+
+ and also updating the list of function implementation list in
+ org.apache.poi.ss.formula.eval.FunctionEval.produceFunctions():
+
+ Excel uses the IEEE Standard for Double Precision Floating Point numbers + except two cases where it does not adhere to IEEE 754: +
++ Be aware of these two cases when saving results of your scientific calculations in Excel: + “where are my Infinities and NaNs? They are gone!” +
+
+ Automated testing of the implemented Function is easy.
+ The source code for this is in the file: org.apache.poi.hssf.record.formula.GenericFormulaTestCase.java.
+ This class has a reference to the test xls file (not a test xls, the test xls :) )
+ which may need to be changed for your environment. Once you do that, in the test xls,
+ locate the entry for the function that you have implemented and enter different tests
+ in a cell in the FORMULA row. Then copy the "value of" the formula that you entered in the
+ cell just below it (this is easily done in excel as:
+ [copy the formula cell] > [go to cell below] > Edit > Paste Special > Values > "ok").
+ You can enter multiple such formulas and paste their values in the cell below and the
+ test framework will automatically test if the formula evaluation matches the expected
+ value (Again, hard to put in words, so if you will, please take time to quickly look
+ at the code and the currently entered tests in the patch attachment "FormulaEvalTestData.xls"
+ file).
+
+ Functions supported by POI (as of v5.2.0 release) +
+The POI formula evaluation code enables you to calculate the result of + formulas in Excels sheets read-in, or created in POI. This document explains + how to use the API to evaluate your formulas. +
+The Excel file format (both .xls and .xlsx) stores a "cached" result for + every formula along with the formula itself. This means that when the file + is opened, it can be quickly displayed, without needing to spend a long + time calculating all of the formula results. It also means that when reading + a file through Apache POI, the result is quickly available to you too! +
+After making changes with Apache POI to either Formula Cells themselves, + or those that they depend on, you should normally perform a Formula + Evaluation to have these "cached" results updated. This is normally done + after all changes have been performed, but before you write the file out. + If you don't do this, there's a good chance that when you open the file in + Excel, until you go to the cell and hit enter or F9, you will either see + the old value or '#VALUE!' for the cell. (Sometimes Excel will notice + itself, and trigger a recalculation on load, but unless you know you are + using volatile functions it's generally best to trigger a Recalulation + through POI) +
+The code currently provides implementations for all the arithmatic operators. + It also provides implementations for approx. 140 built in + functions in Excel. The framework however makes it easy to add + implementation of new functions. See the Formula + evaluation development guide and javadocs + for details.
+Both HSSFWorkbook and XSSFWorkbook are supported, so you can + evaluate formulas on both .xls and .xlsx files.
+User-defined functions are supported, + but must be rewritten in Java and registered with the macro-enabled workbook in order to be evaluated. +
+The following code demonstrates how to use the FormulaEvaluator + in the context of other POI excel reading code. +
+There are several ways in which you can use the FormulaEvalutator API.
+ +This evaluates a given cell, and returns the new value, + without affecting the cell
+Thus using the retrieved value (of type + FormulaEvaluator.CellValue - a nested class) returned + by FormulaEvaluator is similar to using a Cell object + containing the value of the formula evaluation. CellValue is + a simple value object and does not maintain reference + to the original cell. +
+evaluateFormulaCell(Cell cell) + will check to see if the supplied cell is a formula cell. + If it isn't, then no changes will be made to it. If it is, + then the formula is evaluated. The value for the formula + is saved alongside it, to be displayed in excel. The + formula remains in the cell, just with a new value
+The return of the function is the type of the + formula result, such as Cell.CELL_TYPE_BOOLEAN
+evaluateInCell(Cell cell) will check to + see if the supplied cell is a formula cell. If it isn't, + then no changes will be made to it. If it is, then the + formula is evaluated, and the new value saved into the cell, + in place of the old formula.
+Alternately, if you know which of HSSF or XSSF you're working + with, then you can call the static + evaluateAllFormulaCells method on the appropriate + HSSFFormulaEvaluator or XSSFFormulaEvaluator class.
++ In certain cases you may want to force Excel to re-calculate formulas when the workbook is opened. + Consider the following example: +
++ Open Excel and create a new workbook. On the first sheet set A1=1, B1=1, C1=A1+B1. + Excel automatically calculates formulas and the value in C1 is 2. So far so good. +
++ Now modify the workbook with POI: +
++ Now open workbook2.xls in Excel and the value in C1 is still 2 while you expected 3. Wrong? No! + The point is that Excel caches previously calculated results and you need to trigger recalculation to updated them. + It is not an issue when you are creating new workbooks from scratch, but important to remember when you are modifing + existing workbooks with formulas. This can be done in two ways: +
++ 1. Re-evaluate formulas with POI's FormulaEvaluator: +
++ 2. Delegate re-calculation to Excel. The application will perform a full recalculation when the workbook is opened: +
+It is possible for a formula in an Excel spreadsheet to + refer to a Named Range or Cell in a different workbook. + These cross-workbook references are normally called External + References. These are formulas which look something like:
+If you don't have access to these other workbooks, then you + should call + setIgnoreMissingWorkbooks(true) + to tell the Formula Evaluator to skip evaluating any external + references it can't look up.
+In order for POI to be able to evaluate external references, it + needs access to the workbooks in question. As these don't necessarily + have the same names on your system as in the workbook, you need to + give POI a map of external references to open workbooks, through + the + setupReferencedWorkbooks(java.util.Map<java.lang.String,FormulaEvaluator> workbooks) + method. You should normally do something like:
+POI is not perfect and you may stumble across formula evaluation problems (Java exceptions + or just different results) in your special use case. To support an easy detailed analysis, a special + logging of the full evaluation is provided.
+POI 5.1.0 and above uses Log4J 2.x as a logging framework. Try to set up a logging + configuration that lets you see the info and other log messages.
+Example use:
+The special Logger called "POI.FormulaEval" is used (useful if you use the CommonsLogger and a detailed logging configuration). + The used log levels are WARN and INFO (for detailed parameter info and results) - the level are so high to allow this + special logging without being disturbed by the bunch of DEBUG log entries from other classes.
+For versions before 3.13 final, no formula evaluation is possible with + SXSSF.
+If you are using POI 3.13 final or newer, formula evaluation is possible with SXSSF, + but with some caveats.
+The biggest restriction is that, since evaluating a cell needs that cell in memory + and any others it depends on, only pure-function formulas and formulas referencing + nearby cells can be evaluated with SXSSF. If a formula references a cell that hasn't + yet been written, or one which has already been flushed to disk, then it won't be + possible to evaluate it.
+Because of this, a call to wb.getCreationHelper().createFormulaEvaluator().evaluateAll(); + will very rarely work on SXSSF, as it's very rare that all the cells wil be available + and in memory at any time! Instead, it is suggested to evaluate formula cells just + after writing them, or shortly after when cells they depend on are added. Just make + sure that all cells needing or needed for evaluation are inside the window.
+Apache POI comes with a number of examples that demonstrate how you + can use the POI API to create documents from "real life". + The examples below based on common XSSF-HSSF interfaces so that you + can generate either *.xls or *.xlsx output just by setting a + command-line argument: +
+All sample source is available in SVN
+In addition, there are a handful of + HSSF only and + XSSF only examples as well. +
+ ++ The following examples are available: +
+ +The BusinessPlan + application creates a sample business plan with three phases, weekly iterations and time highlighting. Demonstrates advanced cell formatting + (number and date formats, alignments, fills, borders) and various settings for organizing data in a sheet (freezed panes, grouped rows). +
+The Calendar + demo creates a multi sheet calendar. Each month is on a separate sheet. +
+The LoanCalculator + demo creates a simple loan calculator. Demonstrates advance usage of cell formulas and named ranges. +
+The Timesheet + demo creates a weekly timesheet with automatic calculation of total hours. Demonstrates advance usage of cell formulas. +
+The ConditionalFormats + demo is a collection of short examples showing what you can do with Excel conditional formatting in POI: +
+The CalculateMortgage + example demonstrates a simple user-defined function to calculate + principal and interest.
+The CheckFunctionsSupported + example shows how to test what functions and formulas aren't + supported from a given file.
+The SettingExternalFunction + example demonstrates how to use externally provided (third-party) + formula add-ins.
+The UserDefinedFunctionExample + example demonstrates how to invoke a User Defined Function for a + given Workbook instance using POI's UDFFinder implementation.
+The AddDimensionedImage + example demonstrates how to add an image to a worksheet and set that + images size to a specific number of millimetres irrespective of the + width of the columns or height of the rows.
+The AligningCells + example demonstrates how various alignment options work.
+The CellStyleDetails + example demonstrates how to read excel styles for cells.
+The LinkedDropDownLists + example demonstrates one technique that may be used to create linked + or dependent drop down lists.
+The SSPerformanceTest + example provides a way to create simple example files of varying + sizes, and to calculate how long they take. Useful for benchmarking + your system, and to also test if slow performance is due to Apache + POI itself or to your own code.
+The ToHtml + example shows how to display a spreadsheet in HTML using the classes for spreadsheet display. +
+The ToCSV + example demonstrates one way to convert an Excel spreadsheet into a CSV file. +
+All the HSSF-only examples can be found in + SVN
+All the XSSF-only examples can be found in + SVN
+ExcelAnt is a set of Ant tasks that make it possible to verify or test + a workbook without having to write Java code. Of course, the tasks themselves + are written in Java, but to use this framework you only need to know a little + bit about Ant.
+This document covers the basic usage and set up of ExcelAnt.
+This document will assume basic familiarity with Ant and Ant build files.
+To start with ExcelAnt, you'll need to have the POI 3.8 or higher jar files. If you test only .xls +workbooks then you need to have the following jars in your path:
+If you evaluate .xlsx workbooks then you need to add these:
+For example, if you have these jars in a lib/ dir in your project, your build.xml + might look like this:
+Next, you'll need to define the Ant tasks. There are several ways to use ExcelAnt:
+ ++ Where excelant.path refers to the classpath with POI jars. + Using this approach the provided extensions will live in the default namespace. Note that the default task/typenames (evaluate, test) may be too generic and should either be explicitly overridden or used with a namespace. +
+The simplest example of using Excel is the ability to validate that POI is giving you back + the value you expect it to. Does this mean that POI is inaccurate? Hardly. There are cases + where POI is unable to evaluate cells for a variety of reasons. If you need to write code + to integrate a worksheet into an app, you may want to know that it's going to work before + you actually try to write that code. ExcelAnt helps with that.
+ +Consider the mortgage-calculation.xls + file found in the Examples (link broken / file is missing). This sheet is shown below:
+ +This sheet calculates the principal and interest payment for a mortgage based + on the amount of the loan, term and rate. To write a simple ExcelAnt test you + need to tell ExcelAnt about the file like this:
+This code sets up ExcelAnt to access the file defined in the ant property + xls.file. Then it creates a 'test' named 'checkValue'. Finally it tries + to evaluate the B4 on the sheet named 'MortgageCalculator'. There are some assumptions + here that are worth explaining. For starters, ExcelAnt is focused on the testing + numerically oriented sheets. The <evaluate> task is actually evaluating the + cell as a formula using a FormulaEvaluator instance from POI. Therefore it will fail + if you point it to a cell that doesn't contain a formula or a test a plain old number.
+ +Having said all that, here is what the output looks like:
+ +So now we know that at a minimum POI can use our sheet to calculate the existing value. + This is an important point: in many cases sheets have dependencies, i.e., cells they reference. + As is often the case, these cells may have dependencies, which may have dependencies, etc. + The point is that sometimes a dependent cell may get adjusted by a macro or a function + and it may be that POI doesn't have the capabilities to do the same thing. This test + verifies that we can rely on POI to retrieve the default value, based on the stored values + of the sheet. Now we want to know if we can manipulate those dependencies and verify + the output.
+ +To verify that we can manipulate cell values, we need a way in ExcelAnt to set a value. + This is provided by the following task types:
+For the purposes of this example we'll use the <setDouble> task. Let's + start with a $240,000, 30 year loan at 11% (let's pretend it's like 1984). Here + is how we will set that up:
+ +Don't forget that we're verifying the behavior so you need to put all this + into the sheet. That is how I got the result of $2,285 and change. So save your + changes and run it; you should get the following:
+ +This is great, it's working! However, suppose you want to see a little more detail. The + ExcelAnt tasks leverage the Ant logging so you can add the -verbose and -debug flags to + the Ant command line to get more detail. Try adding -verbose. Here is what + you should see:
+ +We see a little more detail. Notice that we see that there is a setting for global precision. + Up until now we've been setting the precision on each evaluate that we call. This + is obviously useful but it gets cumbersome. It would be better if there were a way + that we could specify a global precision - and there is. There is a <precision> + tag that you can specify as a child of the <excelant> tag. Let's go back to + our original task we set up earlier and modify it:
+ +In this example we have set the global precision to 1.0e-3. This means that + in the absence of something more stringent, all tests in the task will use + the global precision. We can still override this by specifying the + precision attribute of all of our <evaluate> task. Let's first run + this task with the global precision and the -verbose flag:
+ +As the output clearly shows, the test itself has no precision but there is + the global precision. Additionally, it tells us we're going to use that + more stringent global value. Now suppose that for this test we want + to use a more stringent precision, say 1.0e-4. We can do that by adding + the precision attribute back to the <evaluate> task:
+ +Now when you re-run this test with the verbose flag you will see that + your test ran and passed with the higher precision:
+POI has an excellent feature (besides ExcelAnt) called User Defined Functions, + that allows you to write Java code that will be used in place of custom VB + code or macros is a spreadsheet. If you have read the documentation and written + your own FreeRefFunction implmentations, ExcelAnt can make use of this code. + For each <excelant> task you define you can nest a <udf> tag + which allows you to specify the function alias and the class name.
+ +Consider the previous example of the mortgage calculator. What if, instead + of being a formula in a cell, it was a function defined in a VB macro? As luck + would have it, we already have an example of this in the examples from the + User Defined Functions example, so let's use that. In the example spreadsheet + there is a tab for MortgageCalculatorFunction, which will use. If you look in + cell B4, you see that rather than a messy cell based formula, there is only the function + call. Let's not get bogged down in the function/Java implementation, as these + are covered in the User Defined Function documentation. Let's just add + a new target and test to our existing build file:
+So if you look at this carefully it looks the same as the previous examples. We + still use the global precision, we're still setting values, and we still want + to evaluate a cell. The only real differences are the sheet name and the + addition of the function.
++ This document describes the current state of formula support in POI. + The information in this document currently applies to the 3.13 version of POI. + Since this area is a work in progress, this document will be updated with new + features as and when they are added. +
+ ++ In org.apache.poi.ss.usermodel.Cell + setCellFormula("formulaString") is used to add a + formula to a sheet, and getCellFormula() is used to retrieve + the string representation of a formula. +
++ We aim to support the complete excel grammar for formulas. Thus, the string that + you pass in to the setCellFormula call should be what you expect to + type into excel. Also, note that you should NOT add a "=" to the front of the string. +
++ Please note that localized versions of Excel allow to enter localized + function-names. However internally Excel stores the English names and thus POI + only supports these and not the localized ones. Also note that only commas may be + used to separate arguments, as per the Excel English style, alternate delimeters + used in other localizations are not supported. +
+To get the list of formula functions that POI supports, you need to + call some code!
+The methods you need are available on + org.apache.poi.ss.formula.eval.FunctionEval. + To find which functions your copy of Apache POI supports, use + getSupportedFunctionNames() + to get a list of the implemented function names. For the list of functions that + POI knows the name of, but doesn't currently implement, use + getNotSupportedFunctionNames() +
++ Formulas in Excel are stored as sequences of tokens in Reverse Polish Notation order. The + open office XLS spec is the best + documentation you will find for the format. +
+ ++ The tokens used by excel are modeled as individual *Ptg classes in the + org.apache.poi.hssf.record.formula package. +
++ The task of parsing a formula string into an array of RPN ordered tokens is done by the + org.apache.poi.ss.formula.FormulaParser class. This class implements a hand + written recursive descent parser. +
++ Formula tokens in Excel are stored in one of three possible operand classes : + Reference, Value and Array. Based on the location of a token, its class can change + in complicated and undocumented ways. While we have support for most cases, we + are not sure if we have covered all bases (since there is no documentation for this area.) + We would therefore like you to report any + occurrence of #VALUE! in a cell upon opening a POI generated workbook in excel. (Check that + typing the formula into Excel directly gives a valid result.) +
+Check out the javadocs for details. +
++ You might find the + 'Excel 97 Developer's Kit' (out of print, Microsoft Press, no + restrictive covenants, available on Amazon.com) helpful for + understanding the file format. +
++ Also useful is the open office XLS spec. We + are collaborating with the maintainer of the spec so if you think you can add something to their + document just send through your changes. +
++ Low level records can be time consuming to created. We created a record + generator to help generate some of the simpler tasks. +
++ We use XML + descriptors to generate the Java code (which sure beats the heck out of + the PERL scripts originally used ;-) for low level records. The + generator is kinda alpha-ish right now and could use some enhancement, + so you may find that to be about 1/2 of the work. Notice this is in + org.apache.poi.hssf.record.definitions. +
+One thing to note: If you are making a large code contribution we need to ensure + any participants in this process have never + signed a "Non Disclosure Agreement" with Microsoft, and have not + received any information covered by such an agreement. If they have + they'll not be able to participate in the POI project. For large contributions we + may ask you to sign an agreement.
+Ask in the dev mailing list for advice.
+Make sure you read the contributing section + as it contains more generation information about contributing to POI in general.
+This release of the how-to outlines functionality for the + current svn trunk. + Those looking for information on previous releases should + look in the documentation distributed with that release.
++ HSSF allows numeric, string, date or formula cell values to be written to + or read from an XLS file. Also + in this release is row and column sizing, cell styling (bold, + italics, borders,etc), and support for both built-in and user + defined data formats. Also available is + an event-based API for reading XLS files. + It differs greatly from the read/write API + and is intended for intermediate developers who need a smaller + memory footprint. +
+There are a few different ways to access the HSSF API. These + have different characteristics, so you should read up on + all to select the best for you.
+ +The high level API (package: org.apache.poi.ss.usermodel) + is what most people should use. Usage is very simple. +
+Workbooks are created by creating an instance of + org.apache.poi.ss.usermodel.Workbook. Either create + a concrete class directly + (org.apache.poi.hssf.usermodel.HSSFWorkbook or + org.apache.poi.xssf.usermodel.XSSFWorkbook), or use + the handy factory class + org.apache.poi.ss.usermodel.WorkbookFactory. +
+Sheets are created by calling createSheet() from an existing + instance of Workbook, the created sheet is automatically added in + sequence to the workbook. Sheets do not in themselves have a sheet + name (the tab at the bottom); you set + the name associated with a sheet by calling + Workbook.setSheetName(sheetindex,"SheetName",encoding). + For HSSF, the name may be in 8bit format + (HSSFWorkbook.ENCODING_COMPRESSED_UNICODE) + or Unicode (HSSFWorkbook.ENCODING_UTF_16). Default + encoding for HSSF is 8bit per char. For XSSF, the name + is automatically handled as unicode. +
+Rows are created by calling createRow(rowNumber) from an existing + instance of Sheet. Only rows that have cell values should be + added to the sheet. To set the row's height, you just call + setRowHeight(height) on the row object. The height must be given in + twips, or 1/20th of a point. If you prefer, there is also a + setRowHeightInPoints method. +
+Cells are created by calling createCell(column, type) from an + existing Row. Only cells that have values should be added to the + row. Cells should have their cell type set to either + Cell.CELL_TYPE_NUMERIC or Cell.CELL_TYPE_STRING depending on + whether they contain a numeric or textual value. Cells must also have + a value set. Set the value by calling setCellValue with either a + String or double as a parameter. Individual cells do not have a + width; you must call setColumnWidth(colindex, width) (use units of + 1/256th of a character) on the Sheet object. (You can't do it on + an individual basis in the GUI either).
+Cells are styled with CellStyle objects which in turn contain + a reference to an Font object. These are created via the + Workbook object by calling createCellStyle() and createFont(). + Once you create the object you must set its parameters (colors, + borders, etc). To set a font for an CellStyle call + setFont(fontobj). +
+Once you have generated your workbook, you can write it out by + calling write(outputStream) from your instance of Workbook, passing + it an OutputStream (for instance, a FileOutputStream or + ServletOutputStream). You must close the OutputStream yourself. HSSF + does not close it for you. +
+Here is some example code (excerpted and adapted from + org.apache.poi.hssf.dev.HSSF test class):
+Reading in a file is equally simple. To read in a file, create a +new instance of org.apache.poi.poifs.Filesystem, passing in an open InputStream, such as a FileInputStream +for your XLS, to the constructor. Construct a new instance of +org.apache.poi.hssf.usermodel.HSSFWorkbook passing the +Filesystem instance to the constructor. From there you have access to +all of the high level model objects through their assessor methods +(workbook.getSheet(sheetNum), sheet.getRow(rownum), etc). +
+Modifying the file you have read in is simple. You retrieve the +object via an assessor method, remove it via a parent object's remove +method (sheet.removeRow(hssfrow)) and create objects just as you +would if creating a new xls. When you are done modifying cells just +call workbook.write(outputstream) just as you did above.
+An example of this can be seen in +org.apache.poi.hssf.usermodel.examples.HSSFReadWrite.
+The event API is newer than the User API. It is intended for intermediate + developers who are willing to learn a little bit of the low level API + structures. Its relatively simple to use, but requires a basic + understanding of the parts of an Excel file (or willingness to + learn). The advantage provided is that you can read an XLS with a + relatively small memory footprint. +
+One important thing to note with the basic Event API is that it + triggers events only for things actually stored within the file. + With the XLS file format, it is quite common for things that + have yet to be edited to simply not exist in the file. This means + there may well be apparent "gaps" in the record stream, which + you either need to work around, or use the + Record Aware extension + to the Event API.
+To use this API you construct an instance of + org.apache.poi.hssf.eventmodel.HSSFRequest. Register a class you + create that supports the + org.apache.poi.hssf.eventmodel.HSSFListener interface using the + HSSFRequest.addListener(yourlistener, recordsid). The recordsid + should be a static reference number (such as BOFRecord.sid) contained + in the classes in org.apache.poi.hssf.record. The trick is you + have to know what these records are. Alternatively you can call + HSSFRequest.addListenerForAllRecords(mylistener). In order to learn + about these records you can either read all of the javadoc in the + org.apache.poi.hssf.record package or you can just hack up a + copy of org.apache.poi.hssf.dev.EFHSSF and adapt it to your + needs. TODO: better documentation on records.
+Once you've registered your listeners in the HSSFRequest object + you can construct an instance of + org.apache.poi.poifs.filesystem.FileSystem (see POIFS howto) and + pass it your XLS file inputstream. You can either pass this, along + with the request you constructed, to an instance of HSSFEventFactory + via the HSSFEventFactory.processWorkbookEvents(request, Filesystem) + method, or you can get an instance of DocumentInputStream from + Filesystem.createDocumentInputStream("Workbook") and pass + it to HSSFEventFactory.processEvents(request, inputStream). Once you + make this call, the listeners that you constructed receive calls to + their processRecord(Record) methods with each Record they are + registered to listen for until the file has been completely read. +
+A code excerpt from org.apache.poi.hssf.dev.EFHSSF (which is + in CVS or the source distribution) is reprinted below with excessive + comments:
++This is an extension to the normal +Event API. With this, your listener +will be called with extra, dummy records. These dummy records should +alert you to records which aren't present in the file (eg cells that have +yet to be edited), and allow you to handle these. +
++There are three dummy records that your HSSFListener will be called with: +
++To use the Record Aware Event API, you should create an +org.apache.poi.hssf.eventusermodel.MissingRecordAwareHSSFListener, and pass +it your HSSFListener. Then, register the MissingRecordAwareHSSFListener +to the event model, and start that as normal. +
+
+One example use for this API is to write a CSV outputter, which always
+outputs a minimum number of columns, even where the file doesn't contain
+some of the rows or cells. It can be found at
+/poi-examples/src/main/java/org/apache/poi/examples/hssf/eventusermodel/XLS2CSVmra.java,
+and may be called on the command line, or from within your own code.
+The latest version is always available from
+subversion.
+
+In POI versions before 3.0.3, this code lived in the scratchpad section. + If you're using one of these older versions of POI, you will either + need to include the scratchpad jar on your classpath, or build from a + subversion checkout. +
+If memory footprint is an issue, then for XSSF, you can get at + the underlying XML data, and process it yourself. This is intended + for intermediate developers who are willing to learn a little bit of + low level structure of .xlsx files, and who are happy processing + XML in java. Its relatively simple to use, but requires a basic + understanding of the file structure. The advantage provided is that + you can read a XLSX file with a relatively small memory footprint. +
+One important thing to note with the basic Event API is that it + triggers events only for things actually stored within the file. + With the XLSX file format, it is quite common for things that + have yet to be edited to simply not exist in the file. This means + there may well be apparent "gaps" in the record stream, which + you need to work around.
+To use this API you construct an instance of + org.apache.poi.xssf.eventmodel.XSSFReader. This will optionally + provide a nice interface on the shared strings table, and the styles. + It provides methods to get the raw xml data from the rest of the + file, which you will then pass to SAX.
+This example shows how to get at a single known sheet, or at + all sheets in the file. It is based on the example in + svn + poi-examples/src/main/java/org/apache/poi/examples/xssf/eventusermodel/FromHowTo.java
++ For a fuller example, including support for fetching number formatting + information and applying it to numeric cells (eg to format dates or + percentages), please see + the XLSX2CSV example in svn +
+An example is also provided + showing how to combine the user API and the SAX API by doing a streaming parse + of larger worksheets and a traditional user-model parse of the rest of a workbook.
++ SXSSF (package: org.apache.poi.xssf.streaming) is an API-compatible streaming extension of XSSF to be used when + very large spreadsheets have to be produced, and heap space is limited. + SXSSF achieves its low memory footprint by limiting access to the rows that + are within a sliding window, while XSSF gives access to all rows in the + document. Older rows that are no longer in the window become inaccessible, + as they are written to the disk. +
++ You can specify the window size at workbook construction time via new SXSSFWorkbook(int windowSize) + or you can set it per-sheet via SXSSFSheet#setRandomAccessWindowSize(int windowSize) +
++ When a new row is created via createRow() and the total number + of unflushed records would exceed the specified window size, then the + row with the lowest index value is flushed and cannot be accessed + via getRow() anymore. +
++ The default window size is 100 and defined by SXSSFWorkbook.DEFAULT_WINDOW_SIZE. +
++ A windowSize of -1 indicates unlimited access. In this case all + records that have not been flushed by a call to flushRows() are available + for random access. +
++ Note that SXSSF allocates temporary files that you must always clean up explicitly, by calling the dispose method. +
++ SXSSFWorkbook defaults to using inline strings instead of a shared strings + table. This is very efficient, since no document content needs to be kept in + memory, but is also known to produce documents that are incompatible with + some clients. With shared strings enabled all unique strings in the document + has to be kept in memory. Depending on your document content this could use + a lot more resources than with shared strings disabled. +
++ Please note that there are still things that still may consume a large + amount of memory based on which features you are using, e.g. merged regions, + hyperlinks, comments, ... are still only stored in memory and thus may require a lot of + memory if used extensively. +
++ Carefully review your memory budget and compatibility needs before deciding + whether to enable shared strings or not. +
+The example below writes a sheet with a window of 100 rows. When the row count reaches 101, + the row with rownum=0 is flushed to disk and removed from memory, when rownum reaches 102 then the row with rownum=1 is flushed, etc. +
+ + +The next example turns off auto-flushing (windowSize=-1) and the code manually controls how portions of data are written to disk
+SXSSF flushes sheet data in temporary files (a temp file per sheet) and the size of these temporary files +can grow to a very large value. For example, for a 20 MB csv data the size of the temp xml becomes more than a gigabyte. +If the size of the temp files is an issue, you can tell SXSSF to use gzip compression: +
+The low level API is not much to look at. It consists of lots of +"Records" in the org.apache.poi.hssf.record.* package, +and set of helper classes in org.apache.poi.hssf.model.*. The +record classes are consistent with the low level binary structures +inside a BIFF8 file (which is embedded in a POIFS file system). You +probably need the book: "Microsoft Excel 97 Developer's Kit" +from Microsoft Press in order to understand how these fit together +(out of print but easily obtainable from Amazon's used books). In +order to gain a good understanding of how to use the low level APIs +should view the source in org.apache.poi.hssf.usermodel.* and +the classes in org.apache.poi.hssf.model.*. You should read the +documentation for the POIFS libraries as well.
+If you wish to generate an XLS file from some XML, it is possible to +write your own XML processing code, then use the User API to write out +the document.
+The other option is to use Cocoon. +In Cocoon, there is the HSSF Serializer, +which takes in XML (in the gnumeric format), and outputs an XLS file for you.
+The HSSF application is nothing more than a test for the high +level API (and indirectly the low level support). The main body of +its code is repeated above. To run it: +
+export HSSFDIR={wherever you put HSSF's jar files}
+export LOG4JDIR={wherever you put LOG4J's jar files}
+export CLASSPATH=$CLASSPATH:$HSSFDIR/hssf.jar:$HSSFDIR/poi-poifs.jar:$HSSFDIR/poi-util.jar:$LOG4JDIR/log4j.jar
+ java org.apache.poi.hssf.dev.HSSF ~/myxls.xls writeThis should generate a test sheet in your home directory called "myxls.xls".
java org.apache.poi.hssf.dev.HSSF ~/input.xls output.xls
+ HSSF has a number of tools useful for developers to debug/develop +stuff using HSSF (and more generally XLS files). We've already +discussed the app for testing HSSF read/write/modify capabilities; +now we'll talk a bit about BiffViewer. Early on in the development of +HSSF, it was decided that knowing what was in a record, what was +wrong with it, etc. was virtually impossible with the available +tools. So we developed BiffViewer. You can find it at +org.apache.poi.hssf.dev.BiffViewer. It performs two basic +functions and a derivative. +
+The first is "biffview". To do this you run it (assumes +you have everything setup in your classpath and that you know what +you're doing enough to be thinking about this) with an xls file as a +parameter. It will give you a listing of all understood records with +their data and a list of not-yet-understood records with no data +(because it doesn't know how to interpret them). This listing is +useful for several things. First, you can look at the values and SEE +what is wrong in quasi-English. Second, you can send the output to a +file and compare it. +
+The second function is "big freakin dump", just pass a +file and a second argument matching "bfd" exactly. This +will just make a big hexdump of the file. +
+Lastly, there is "mixed" mode which does the same as +regular biffview, only it includes hex dumps of certain records +intertwined. To use that just pass a file with a second argument +matching "on" exactly.
+In the next release cycle we'll also have something called a +FormulaViewer. The class is already there, but its not very useful +yet. When it does something, we'll document it.
+ +Further effort on HSSF is going to focus on the following major areas:
+HSSF is the POI Project's pure Java implementation of the + Excel '97(-2007) file format. XSSF is the POI Project's pure + Java implementation of the Excel 2007 OOXML (.xlsx) file + format.
+HSSF and XSSF provides ways to read spreadsheets create, + modify, read and write XLS spreadsheets. They provide: +
+For people converting from pure HSSF usermodel, who wish + to use the joint SS Usermodel for HSSF and XSSF support, then + see the ss usermodel converting + guide. +
++ An alternate way of generating a spreadsheet is via the Cocoon serializer (yet you'll still be using HSSF indirectly). + With Cocoon you can serialize any XML datasource (which might be a ESQL page outputting in SQL for instance) by simply + applying the stylesheet and designating the serializer. +
++ If you're merely reading spreadsheet data, then use the + eventmodel api in either the org.apache.poi.hssf.eventusermodel + package, or the org.apache.poi.xssf.eventusermodel package, depending + on your file format. +
++ If you're modifying spreadsheet data then use the usermodel api. You + can also generate spreadsheets this way. +
++ Note that the usermodel system has a higher memory footprint than + the low level eventusermodel, but has the major advantage of being + much simpler to work with. Also please be aware that as the new + XSSF supported Excel 2007 OOXML (.xlsx) files are XML based, + the memory footprint for processing them is higher than for the + older HSSF supported (.xls) binary files. +
+ + + +Since 3.8-beta3, POI provides a low-memory footprint SXSSF API built on top of XSSF.
++SXSSF is an API-compatible streaming extension of XSSF to be used when +very large spreadsheets have to be produced, and heap space is limited. +SXSSF achieves its low memory footprint by limiting access to the rows that +are within a sliding window, while XSSF gives access to all rows in the +document. Older rows that are no longer in the window become inaccessible, +as they are written to the disk. +
++In auto-flush mode the size of the access window can be specified, to hold a certain number of rows in memory. +When that value is reached, the creation of an additional row causes the row with the lowest index to to be +removed from the access window and written to disk. Or, the window size can be set to grow dynamically; +it can be trimmed periodically by an explicit call to flushRows(int keepRows) as needed. +
++Due to the streaming nature of the implementation, there are the following +limitations when compared to XSSF: +
+See more details at SXSSF How-To
+ +The table below synopsizes the comparative features of POI's Spreadsheet API:
+Spreadsheet API Feature Summary
+ +
+
+
+ The intent of this document is to outline some of the known limitations of the + POI HSSF and XSSF APIs. It is not intended to be complete list of every bug + or missing feature of HSSF or XSSF, rather it's purpose is to provide a broad + feel for some of the functionality that is missing or broken. +
++ Want to use HSSF and XSSF read and write spreadsheets in a hurry? This + guide is for you. If you're after more in-depth coverage of the HSSF and + XSSF user-APIs, please consult the HOWTO + guide as it contains actual descriptions of how to use this stuff. +
+When opening a workbook, either a .xls HSSFWorkbook, or a .xlsx + XSSFWorkbook, the Workbook can be loaded from either a File + or an InputStream. Using a File object allows for + lower memory consumption, while an InputStream requires more + memory as it has to buffer the whole file.
+If using WorkbookFactory, it's very easy to use one or + the other:
+If using HSSFWorkbook or XSSFWorkbook directly, + you should generally go through POIFSFileSystem or + OPCPackage, to have full control of the lifecycle (including + closing the file when done):
+Sometimes, you'd like to just iterate over all the sheets in + a workbook, all the rows in a sheet, or all the cells in a row. + This is possible with a simple for loop.
+These iterators are available by calling workbook.sheetIterator(), + sheet.rowIterator(), and row.cellIterator(), or + implicitly using a for-each loop. + Note that a rowIterator and cellIterator iterate over rows or + cells that have been created, skipping empty rows and cells.
+ +In some cases, when iterating, you need full control over how + missing or blank rows and cells are treated, and you need to ensure + you visit every cell and not just those defined in the file. (The + CellIterator will only return the cells defined in the file, which + is largely those with values or stylings, but it depends on Excel).
+In cases such as these, you should fetch the first and last column + information for a row, then call getCell(int, MissingCellPolicy) + to fetch the cell. Use a + MissingCellPolicy + to control how blank or null cells are handled.
+To get the contents of a cell, you first need to + know what kind of cell it is (asking a string cell + for its numeric contents will get you a + NumberFormatException for example). So, you will + want to switch on the cell's type, and then call + the appropriate getter for that cell.
+In the code below, we loop over every cell + in one sheet, print out the cell's reference + (eg A3), and then the cell's contents.
+For most text extraction requirements, the standard + ExcelExtractor class should provide all you need.
+For very fancy text extraction, XLS to CSV etc, + take a look at + /poi-examples/src/main/java/org/apache/poi/examples/hssf/eventusermodel/XLS2CSVmra.java +
++ Note, the maximum number of unique fonts in a workbook is limited to 32767. You should re-use fonts in your applications instead of + creating a font for each cell. +Examples: +
+Wrong:
+Correct:
+HSSF:
+XSSF:
++ The convenience functions provide + utility features such as setting borders around merged + regions and changing style attributes without explicitly + creating new styles. +
++ The zoom is expressed as a fraction. For example to + express a zoom of 75% use 3 for the numerator and + 4 for the denominator. +
++ There are two types of panes you can create; freeze panes and split panes. +
++ A freeze pane is split by columns and rows. You create + a freeze pane using the following mechanism: +
++ sheet1.createFreezePane( 3, 2, 3, 2 ); +
++ The first two parameters are the columns and rows you + wish to split by. The second two parameters indicate + the cells that are visible in the bottom right quadrant. +
++ + Split panes appear differently. The split area is + divided into four separate work area's. The split + occurs at the pixel level and the user is able to + adjust the split by dragging it to a new position. +
++ + Split panes are created with the following call: +
++ sheet2.createSplitPane( 2000, 2000, 0, 0, Sheet.PANE_LOWER_LEFT ); +
++ + The first parameter is the x position of the split. + This is in 1/20th of a point. A point in this case + seems to equate to a pixel. The second parameter is + the y position of the split. Again in 1/20th of a point. +
++ The last parameter indicates which pane currently has + the focus. This will be one of Sheet.PANE_LOWER_LEFT, + PANE_LOWER_RIGHT, PANE_UPPER_RIGHT or PANE_UPPER_LEFT. +
++ It's possible to set up repeating rows and columns in + your printouts by using the setRepeatingRows() and + setRepeatingColumns() methods in the Sheet class. +
++ These methods expect a CellRangeAddress parameter + which specifies the range for the rows or columns to + repeat. + For setRepeatingRows(), it should specify a range of + rows to repeat, with the column part spanning all + columns. + For setRepeatingColumns(), it should specify a range of + columns to repeat, with the row part spanning all + rows. + If the parameter is null, the repeating rows or columns + will be removed. +
++ Example is for headers but applies directly to footers. +
++ Example is for headers but applies directly to footers. Note, the above example for + basic headers and footers applies to XSSF Workbooks as well as HSSF Workbooks. The HSSFHeader + stuff does not work for XSSF Workbooks. +
++ XSSF has the ability to handle First page headers and footers, as well as Even/Odd + headers and footers. All Header/Footer Property flags can be handled in XSSF as well. + The odd header and footer is the default header and footer. It is displayed on all + pages that do not display either a first page header or an even page header. That is, + if the Even header/footer does not exist, then the odd header/footer is displayed on + even pages. If the first page header/footer does not exist, then the odd header/footer + is displayed on the first page. If the even/odd property is not set, that is the same as + the even header/footer not existing. If the first page property does not exist, that is + the same as the first page header/footer not existing. +
+
+ POI supports drawing shapes using the Microsoft Office
+ drawing tools. Shapes on a sheet are organized in a
+ hierarchy of groups and and shapes. The top-most shape
+ is the patriarch. This is not visible on the sheet
+ at all. To start drawing you need to call createPatriarch
+ on the HSSFSheet class. This has the
+ effect erasing any other shape information stored
+ in that sheet. By default POI will leave shape
+ records alone in the sheet unless you make a call to
+ this method.
+
+ To create a shape you have to go through the following + steps: +
++ Text boxes are created using a different call: +
++ It's possible to use different fonts to style parts of + the text in the textbox. Here's how: +
+
+ Just as can be done manually using Excel, it is possible
+ to group shapes together. This is done by calling
+ createGroup() and then creating the shapes
+ using those groups.
+
+ It's also possible to create groups within groups. +
++ Here's how to create a shape group: +
+
+ If you're being observant you'll noticed that the shapes
+ that are added to the group use a new type of anchor:
+ the HSSFChildAnchor. What happens is that
+ the created group has its own coordinate space for
+ shapes that are placed into it. POI defaults this to
+ (0,0,1023,255) but you are able to change it as desired.
+ Here's how:
+
+ If you create a group within a group it's also going + to have its own coordinate space. +
++ By default shapes can look a little plain. It's possible + to apply different styles to the shapes however. The + sorts of things that can currently be done are: +
++ Here's an examples of how this is done: +
+
+ While the native POI shape drawing commands are the
+ recommended way to draw shapes in a shape it's sometimes
+ desirable to use a standard API for compatibility with
+ external libraries. With this in mind we created some
+ wrappers for Graphics and Graphics2d.
+
Graphics2d is a poor match to the capabilities
+ of the Microsoft Office drawing commands. The older
+ Graphics class offers a closer match but is
+ still a square peg in a round hole.
+
+ All Graphics commands are issued into an HSSFShapeGroup.
+ Here's how it's done:
+
+ The first thing we do is create the group and set its coordinates
+ to match what we plan to draw. Next we calculate a reasonable
+ fontSizeMultiplier then create the EscherGraphics object.
+ Since what we really want is a Graphics2d
+ object we create an EscherGraphics2d object and pass in
+ the graphics object we created. Finally we call a routine
+ that draws into the EscherGraphics2d object.
+
+ The vertical points per pixel deserves some more explanation. + One of the difficulties in converting Graphics calls + into escher drawing calls is that Excel does not have + the concept of absolute pixel positions. It measures + its cell widths in 'characters' and the cell heights in points. + Unfortunately it's not defined exactly what type of character it's + measuring. Presumably this is due to the fact that the Excel will be + using different fonts on different platforms or even within the same + platform. +
++ Because of this constraint we've had to implement the concept of a + verticalPointsPerPixel. This the amount the font should be scaled by when + you issue commands such as drawString(). To calculate this value + use the follow formula: +
+
+ The height of the group is calculated fairly simply by calculating the
+ difference between the y coordinates of the bounding box of the shape. The
+ height of the group can be calculated by using a convenience called
+ HSSFClientAnchor.getAnchorHeightInPoints().
+
+ Many of the functions supported by the graphics classes + are not complete. Here's some of the functions that are known + to work. +
++ Functions that are not supported will return and log a message + using the POI logging infrastructure (disabled by default). +
++ Outlines are great for grouping sections of information + together and can be added easily to columns and rows + using the POI API. Here's how: +
++ To collapse (or expand) an outline use the following calls: +
++ The row/column you choose should contain an already + created group. It can be anywhere within the group. +
+
+ Images are part of the drawing support. To add an image just
+ call createPicture() on the drawing patriarch.
+ At the time of writing the following types are supported:
+
+ It should be noted that any existing drawings may be erased + once you add an image to a sheet. +
+Reading images from a workbook:
+
+ Named Range is a way to refer to a group of cells by a name. Named Cell is a
+ degenerate case of Named Range in that the 'group of cells' contains exactly one
+ cell. You can create as well as refer to cells in a workbook by their named range.
+ When working with Named Ranges, the classes org.apache.poi.ss.util.CellReference
+ and org.apache.poi.ss.util.AreaReference are used.
+
+ Note: Using relative values like 'A1:B1' can lead to unexpected moving of + the cell that the name points to when working with the workbook in Microsoft Excel, + usually using absolute references like '$A$1:$B$1' avoids this, see also + this discussion. +
++ Creating Named Range / Named Cell +
++ Reading from Named Range / Named Cell +
++ Reading from non-contiguous Named Ranges +
++ Note, when a cell is deleted, Excel does not delete the + attached named range. As result, workbook can contain + named ranges that point to cells that no longer exist. + You should check the validity of a reference before + constructing AreaReference +
++ A comment is a rich text note that is attached to & + associated with a cell, separate from other cell content. + Comment content is stored separate from the cell, and is displayed in a drawing object (like a text box) + that is separate from, but associated with, a cell +
++ Reading cell comments +
+To get all the comments on a sheet:
++ For SXSSFWorkbooks only, because the random access window is likely to exclude most of the rows + in the worksheet, which are needed for computing the best-fit width of a column, the columns must + be tracked for auto-sizing prior to flushing any rows. +
++ Note, that Sheet#autoSizeColumn() does not evaluate formula cells, + the width of formula cells is calculated based on the cached formula result. + If your workbook has many formulas then it is a good idea to evaluate them before auto-sizing. +
+ java.awt.headless=true .
+ You should also ensure that the fonts you use in your workbook are
+ available to Java.
+ + As of version 3.8, POI has slightly different syntax to work with data validations with .xls and .xlsx formats. +
+Check the value a user enters into a cell against one or more predefined value(s).
+The following code will limit the value the user can enter into cell A1 to one of three integer values, 10, 20 or 30.
+Drop Down Lists:
+This code will do the same but offer the user a drop down list to select a value from.
+Messages On Error:
+To create a message box that will be shown to the user if the value they enter is invalid.
+Replace 'Box Title' with the text you wish to display in the message box's title bar + and 'Message Text' with the text of your error message.
+Prompts:
+To create a prompt that the user will see when the cell containing the data validation receives focus
+The text encapsulated in the first parameter passed to the createPromptBox() method will appear emboldened + and as a title to the prompt whilst the second will be displayed as the text of the message. + The createExplicitListConstraint() method can be passed and array of String(s) containing interger, floating point, dates or text values.
+ +Further Data Validations:
+To obtain a validation that would check the value entered was, for example, an integer between 10 and 100, + use the DVConstraint.createNumericConstraint(int, int, String, String) factory method.
+Look at the javadoc for the other validation and operator types; also note that not all validation + types are supported for this method. The values passed to the two String parameters can be formulas; the '=' symbol is used to denote a formula
+It is not possible to create a drop down list if the createNumericConstraint() method is called, + the setSuppressDropDownArrow(false) method call will simply be ignored.
+Date and time constraints can be created by calling the createDateConstraint(int, String, String, String) + or the createTimeConstraint(int, String, String). Both are very similar to the above and are explained in the javadoc.
+Creating Data Validations From Spreadsheet Cells.
+The contents of specific cells can be used to provide the values for the data validation + and the DVConstraint.createFormulaListConstraint(String) method supports this. + To specify that the values come from a contiguous range of cells do either of the following:
+or
+and in both cases the user will be able to select from a drop down list containing the values from cells A1, A2 and A3.
+The data does not have to be as the data validation. To select the data from a different sheet however, the sheet + must be given a name when created and that name should be used in the formula. So assuming the existence of a sheet named 'Data Sheet' this will work:
+as will this:
+whilst this will not:
+and nor will this:
+Data validations work similarly when you are creating an xml based, SpreadsheetML, +workbook file; but there are differences. Explicit casts are required, for example, +in a few places as much of the support for data validations in the xssf stream was +built into the unifying ss stream, of which more later. Other differences are +noted with comments in the code. +
+ +Check the value the user enters into a cell against one or more predefined value(s).
+Drop Down Lists:
+This code will do the same but offer the user a drop down list to select a value from.
+Note that the call to the setSuppressDropDowmArrow() method can either be simply excluded or replaced with:
+Prompts and Error Messages:
++These both exactly mirror the hssf.usermodel so please refer to the 'Messages On Error:' and 'Prompts:' sections above. +
+ +Further Data Validations:
++To obtain a validation that would check the value entered was, for example, +an integer between 10 and 100, use the XSSFDataValidationHelper(s) createNumericConstraint(int, int, String, String) factory method. +
++The values passed to the final two String parameters can be formulas; the '=' symbol is used to denote a formula. +Thus, the following would create a validation the allows values only if they fall between the results of summing two cell ranges +
++It is not possible to create a drop down list if the createNumericConstraint() method is called, +the setSuppressDropDownArrow(true) method call will simply be ignored. +
++Please check the javadoc for other constraint types as examples for those will not be included here. +There are, for example, methods defined on the XSSFDataValidationHelper class allowing you to create +the following types of constraint; date, time, decimal, integer, numeric, formula, text length and custom constraints. +
+Creating Data Validations From Spread Sheet Cells:
++One other type of constraint not mentioned above is the formula list constraint. +It allows you to create a validation that takes it value(s) from a range of cells. This code +
++would create a validation that took it's values from cells in the range A1 to F1. +
++The usefulness of this technique can be extended if you use named ranges like this; +
+ ++OpenOffice Calc has slightly different rules with regard to the scope of names. +Excel supports both Workbook and Sheet scope for a name but Calc does not, it seems only to support Sheet scope for a name. +Thus it is often best to fully qualify the name for the region or area something like this; +
++This does open a further, interesting opportunity however and that is to place all of the data for the validation(s) into named ranges of cells on a hidden sheet within the workbook. These ranges can then be explicitly identified in the setRefersToFormula() method argument. +
++The classes within the ss.usermodel package allow developers to create code that can be used +to generate both binary (.xls) and SpreadsheetML (.xlsx) workbooks. +
++The techniques used to create data validations share much in common with the xssf.usermodel examples above. +As a result just one or two examples will be presented here. +
+Check the value the user enters into a cell against one or more predefined value(s).
+Drop Down Lists:
+ +This code will do the same but offer the user a drop down list to select a value from.
+ +Prompts and Error Messages:
++These both exactly mirror the hssf.usermodel so please refer to the 'Messages On Error:' and 'Prompts:' sections above. +
++As the differences between the ss.usermodel and xssf.usermodel examples are small - +restricted largely to the way the DataValidationHelper is obtained, the lack of any +need to explicitly cast data types and the small difference in behaviour between +the hssf and xssf interpretation of the setSuppressDropDowmArrow() method, +no further examples will be included in this section. +
+Advanced Data Validations.
+Dependent Drop Down Lists.
++In some cases, it may be necessary to present to the user a sheet which contains more than one drop down list. +Further, the choice the user makes in one drop down list may affect the options that are presented to them in +the second or subsequent drop down lists. One technique that may be used to implement this behaviour will now be explained. +
++There are two keys to the technique; one is to use named areas or regions of cells to hold the data for the drop down lists, +the second is to use the INDIRECT() function to convert between the name and the actual addresses of the cells. +In the example section there is a complete working example- called LinkedDropDownLists.java - +that demonstrates how to create linked or dependent drop down lists. Only the more relevant points are explained here. +
++To create two drop down lists where the options shown in the second depend upon the selection made in the first, +begin by creating a named region of cells to hold all of the data for populating the first drop down list. +Next, create a data validation that will look to this named area for its data, something like this; +
++Note that the name of the area - in the example above it is 'CHOICES' - +is simply passed to the createFormulaListConstraint() method. This is sufficient +to cause Excel to populate the drop down list with data from that named region. +
++Next, for each of the options the user could select in the first drop down list, +create a matching named region of cells. The name of that region should match the +text the user could select in the first drop down list. Note, in the example, +all upper case letters are used in the names of the regions of cells. +
+ ++Now, very similar code can be used to create a second, linked, drop down list; +
+ ++The key here is in the following Excel function - INDIRECT(UPPER($A$1)) - which is used to populate the second, +linked, drop down list. Working from the inner-most pair of brackets, it instructs Excel to look +at the contents of cell A1, to convert what it reads there into upper case – as upper case letters are used +in the names of each region - and then convert this name into the addresses of those cells that contain +the data to populate another drop down list. +
+It is possible to perform more detailed processing of an embedded Excel, Word or PowerPoint document, + or to work with any other type of embedded object.
+HSSF:
+XSSF:
+(Since POI-3.7)
+See more examples on Excel conditional formatting in + ConditionalFormats.java +
+ ++ Using Excel, it is possible to hide a row on a worksheet by selecting that row (or rows), + right clicking once on the right hand mouse button and selecting 'Hide' from the pop-up menu that appears. +
++ To emulate this using POI, simply call the setZeroHeight() method on an instance of either + XSSFRow or HSSFRow (the method is defined on the ss.usermodel.Row interface that both classes implement), like this: +
++ If the file were saved away to disc now, then the first row on the first sheet would not be visible. +
++ Using Excel, it is possible to unhide previously hidden rows by selecting the row above and the row below + the one that is hidden and then pressing and holding down the Ctrl key, the Shift and the pressing + the number 9 before releasing them all. +
++ To emulate this behaviour using POI do something like this: +
++ If the file were saved away to disc now, any previously hidden rows on the first sheet of the workbook would now be visible. +
++ The example illustrates two features. Firstly, that it is possible to unhide a row simply by calling the setZeroHeight() + method and passing the boolean value 'false'. Secondly, it illustrates how to test whether a row is hidden or not. + Simply call the getZeroHeight() method and it will return 'true' if the row is hidden, 'false' otherwise. +
++ Sometimes it is easier or more efficient to create a spreadsheet with basic styles and then apply special styles to certain cells + such as drawing borders around a range of cells or setting fills for a region. CellUtil.setCellProperties lets you do that without creating + a bunch of unnecessary intermediate styles in your spreadsheet. +
++ Properties are created as a Map and applied to a cell in the following manner. +
++ NOTE: This does not replace the properties of the cell, it merges the properties you have put into the Map with the + cell's existing style properties. If a property already exists, it is replaced with the new property. If a property does not + exist, it is added. This method will not remove CellStyle properties. +
++ In Excel, you can apply a set of borders on an entire workbook region at the press of a button. The PropertyTemplate + object simulates this with methods and constants defined to allow drawing top, bottom, left, right, horizontal, + vertical, inside, outside, or all borders around a range of cells. Additional methods allow for applying colors + to the borders. +
++ It works like this: you create a PropertyTemplate object which is a container for the borders you wish to apply to a + sheet. Then you add borders and colors to the PropertyTemplate, and finally apply it to whichever sheets you need + that set of borders on. You can create multiple PropertyTemplate objects and apply them to a single sheet, or you can + apply the same PropertyTemplate object to multiple sheets. It is just like a preprinted form. +
++ Enums: +
++ NOTE: The last pt.drawBorders() call removes the borders from the range by using BorderStyle.NONE. Like + setCellStyleProperties, the applyBorders method merges the properties of a cell style, so existing borders + are changed only if they are replaced by something else, or removed only if they are replaced by + BorderStyle.NONE. To remove a color from a border, use IndexedColor.AUTOMATIC.getIndex(). +
+Additionally, to remove a border or color from the PropertyTemplate object, use BorderExtent.NONE.
++ This does not work with diagonal borders yet. +
++ Pivot Tables are a powerful feature of spreadsheet files. You can create a pivot table with the following piece of code. +
++ To apply a single set of text formatting (colour, style, font etc) + to a cell, you should create a + CellStyle + for the workbook, then apply to the cells. +
++ To apply different formatting to different parts of a cell, you + need to use + RichTextString, + which permits styling of parts of the text within the cell. +
++ There are some slight differences between HSSF and XSSF, especially + around font colours (the two formats store colours quite differently + internally), refer to the + HSSF Rich Text String + and + XSSF Rich Text String + javadocs for more details. +
++ The record generator was born from frustration with translating + the Excel records to Java classes. Doing this manually is a time + consuming process. It's also very easy to make mistakes. +
++ A utility was needed to take the definition of what a + record looked like and do all the boring and repetitive work. +
++ The record generator takes XML as input and produces the following + output: +
+
+ The record generator is invoked as an Ant target
+ (generate-records). It goes through looking for all files in
+ src/records/definitions ending with _record.xml.
+ It then creates two files; the Java record definition and the
+ Java test case template.
+
+ The records themselves have the following general layout: +
++ The following table details the allowable types and sizes for + the fields. +
+| Type | +Size | +Java Type | +
|---|---|---|
| int | +1 | +byte | +
| int | +2 | +short | +
| int | +4 | +int | +
| int | +8 | +long | +
| int | +varword | +array of shorts | +
| bits | +1 | +A byte comprising of a bits (defined by the bit element) + | +
| bits | +2 | +An short comprising of a bits | +
| bits | +4 | +A int comprising of a bits | +
| float | +8 | +double | +
| hbstring | +java expression | +String | +
+ The Java records are regenerated each time the record generator is + run, however the test stubs are only created if the test stub does + not already exist. What this means is that you may change test + stubs but not the generated records. +
++ Occasionally the builtin types are not enough. More control + over the encoding and decoding of the streams is required. This + can be achieved using a custom type. +
++ A custom type lets you escape to java to define the way in which + the field encodes and decodes. To code a custom type you + declare your field like this: +
+
+ Where the class name specified after custom: is a
+ class implementing the interface CustomField.
+
+ You can then implement the encoding yourself. +
++ The record generation works by taking an XML file and styling it + using XSLT. Given that XSLT is a little limited in some ways it was + necessary to add a little Java code to the mix. +
++ See record.xsl, record_test.xsl, FieldIterator.java, + RecordUtil.java, RecordGenerator.java +
++ There is a corresponding "type" generator for HWPF. + See the HWPF documentation for details. +
++ The record generator does not handle all possible record types and + goes not intend to perform this function. When dealing with a + non-standard record sometimes the cost-benefit of coding the + record by hand will be greater than attempting modify the + generator. The main point of the record generator is to save + time, so keep that in mind. +
++ Currently the XSL file that generates the record calls out to + Java objects. The Java code for the record generation is + currently quite messy with minimal comments. +
+Primary Actor: HSSF client
+Scope: HSSF
+Level: Summary
+Stakeholders and Interests:
+Precondition: None
+Minimal Guarantee: None
+Main Success Guarantee:
+Extensions:
+2a. Exceptions +thrown by POIFS will be passed on to the HSSF client.
+Primary Actor: HSSF client
+Scope: HSSF
+Level: Summary
+Stakeholders and Interests:
+Precondition:
+Minimal Guarantee: None
+Main Success Guarantee:
+Extensions:
+3a. Exceptions +from POIFS are passed to the HSSF client.
+ +Primary Actor: HSSF client
+Scope: HSSF
++Level: Summary
+Stakeholders and Interests:
+Precondition:
+Minimal Guarantee: None
+Main Success Guarantee:
+Extensions: +None
+ +Primary Actor: HSSF
+Scope: HSSF
++Level: Summary
+Stakeholders and Interests:
+Precondition:
+Minimal +Guarantee: None
+Main Success Guarantee:
+Extensions:
+3a. Exceptions +thrown by POIFS will be passed on
+Primary Actor: HSSF
+Scope: HSSF
++Level: Summary
+Stakeholders and Interests:
+Precondition: +
+Minimal Guarantee: None
+Main Success Guarantee:
+Extensions:None
+This document describes the User Defined Functions within POI. + User defined functions allow you to take code that is written in VBA + and re-write in Java and use within POI. Consider the following example.
+Suppose you are given a spreadsheet that can calculate the principal and interest + payments for a mortgage. The user enters the principal loan amount, the interest rate + and the term of the loan. The Excel spreadsheet does the rest.
+
+
+
When you actually look at the workbook you discover that rather than having + the formula in a cell it has been written as VBA function. You review the + function and determine that it could be written in Java:
+
+
+
If we write a small program to try to evaluate this cell, we'll fail. Consider this source code:
+If you run this code, you're likely to get the following error:
+ +How would we make it so POI can use this sheet?
+To 'convert' this code to Java and make it available to POI you need to implement + a FreeRefFunction instance. FreeRefFunction is an interface in the org.apache.poi.ss.formula.functions + package. This interface defines one method, evaluate(ValueEval[] args, OperationEvaluationContext ec), + which is how you will receive the argument values from POI.
+The evaluate() method as defined above is where you will convert the ValueEval instances to the + proper number types. The following code snippet shows you how to get your values:
+ +The first thing we do is check the number of arguments being passed since there is no sense + in attempting to go further if you are missing critical information.
+Next we declare our variables, in our case we need variables for:
+Next, we use the OperandResolver to convert the ValueEval instances to doubles, though not directly. + First we start by getting discreet values. Using the OperandResolver.getSingleValue() method + we retrieve each of the values passed in by the cell in the spreadsheet. Next, we use the + OperandResolver again to convert the ValueEval instances to doubles, in this case. This + class has other methods of coercion for getting Strings, ints and booleans. Now that we've + got our primitive values we can move on to calculating the value.
+As shown previously, we have the VBA source. We need to add code to our class to calculate + the payment. To do this you could simply add it to the method we've already created but I've + chosen to add it as its own method. Add the following method:
+The biggest change necessary is related to the exponents; Java doesn't have a notation for this + so we had to add calls to Math.pow(). Now we need to add this call to our previous method:
+Having done that, the last things we need to do are to check to make sure we didn't get a bad result and, + if not, we need to return the value. Add the following code to the class:
+Then add a line of code to our evaluate method to call this new static method, complete our try/catch and return the value:
+So the whole class would be as follows:
+ +Great! Now we need to go back to our original program that failed to evaluate our cell and add code that will allow it run our new Java code.
+ +Now we need to register our function in the Workbook, so that the Formula Evaluator can resolve the name "calculatePayment" +and map it to the actual implementation (CalculateMortgage). This is done using the UDFFinder object. +The UDFFinder manages FreeRefFunctions which are our analogy for the VBA code. We need to create a UDFFinder. There are + a few things we need to know in order to do this:
+UDFFinder is actually an interface, so we need to use an actual implementation of this interface. Therefore we use the org.apache.poi.ss.formula.udf.DefaultUDFFinder class. If you refer to the Javadocs you'll see that this class expects to get two arrays, one + containing the alias and the other containing an instance of the class that will represent that alias. In our case our alias will be calculatePayment + and our class instance will be of the CalculateMortgage type. This class needs to be available at compile and runtime. Be sure to keep these arrays + well organized because you'll run into problems if these arrays are of different sizes or the alias aren't in the same relative position in their respective + arrays. Add the following code:
+Now we have our UDFFinder instance and we've created the AggregatingUDFFinder instance. The last step is to pass this to our Workbook:
+ +So now the whole class will look like this:
+Now that our evaluator is aware of the UDFFinder which in turn is aware of our FreeRefFunction, we're ready to re-run our example:
+which prints the following output in the console:
+That is it! Now you can create Java code and register it, allowing your POI based appliction to run spreadsheets that previously were inaccessible.
+This example can be found in the poi-examples/src/main/java/org/apache/poi/examples/ss/formula folder in the source.
++ Any information in here that might be perceived as legal information is + informational only. We're not lawyers, so consult a legal professional + if needed. +
++ The POI project is OpenSource + and developed/distributed under the + Apache Software License v2. Unlike some other licenses, the Apache + license allows free open source development. Unlike some other Open Source + licenses, it does not require you to release your source or use any + particular license for your code which builds on top of it. (There are a + handful of restrictions, especially around attribution, notices and trademarks, + so it's worth a read of the license - it isn't scary!). If you wish to + contribute to Apache POI (which you're very welcome and encouraged to do so), + then you must agree to grant your contributions to us under the same license. +
+There are a lot of open issues in Bugzilla and TODOs in the code. Please see + the section below for more on these. Get in touch using our mailing lists if you want + to volunteer.
+The Apache Contributors Tech Guide gives a good overview how to start contributing patches.
+ +The Nutch project also have a very useful guide on becoming a + new developer in their project. While it is written for their project, + a large part of it will apply to POI too. You can read it at + http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer. The + Apache Community Development + Project also provides guidance and mentoring for new contributors.
++ If you use GitHub, you can submit Pull Requests to https://github.com/apache/poi. It is probably + a good idea to create an issue in the Bug Database + first and reference it in the PR. +
++ For Subversion fans, you can add patch files to the Bugzilla issues at + Bug Database. + If there is already a bug-report, attach it there, otherwise create a new bug, + set the subject to [PATCH] followed by a brief description. + Explain you patch and any special instructions and submit/save it. + Next, go back to the bug, and create attachments for the patch files you + created. Be sure to describe not only the files purpose, but its format. + (Is that ZIP or a tgz or a bz2 or what?). +
++ Ideally, patches should be submitted early and often. This is for + two key reasons. Firstly, it's much easier to review smaller patches + than large ones. This means that smaller patches are much more likely + to be applied to SVN in a timely fashion. Secondly, by sending in your + patches earlier rather than later, it's much easier to get feedback + on your coding and direction. If you've missed an easier way to do something, + or are duplicating some (probably hidden) existing code, or taking things + in an unusual direction, it's best to get the feedback sooner rather than + later! As such, when submitting patches to POI, as with other Apache + Software Foundation projects, do please try to submit early and often, rather + than "throwing a large patch over the wall" at the end. +
++ A number of Apache projects provide far more comprehensive guides to producing + and submitting patches than we do, you may wish to review some of their + information if you're unsure. The + Apache Commons one + is fairly similar as a starting point. +
+You may create your patch file using either of the following approaches (the committers recommend the first):
+Use Ant to generate a patch file to POI:
+
+ This will create a file named patch.tar.gz that will contain a unified diff of files that have been modified
+ and also include files that have been added. Review the file for completeness and correctness. This approach
+ is recommended because it standardizes the way in which patch files are constructed. It also eliminates the
+ chance of you missing to submit new files that constitute part of the patch.
+
+ To apply a previously generated patch.tar.gz file to a clean subversion checkout, use the following command.
+ It will unpack the tarball and add new files to the subversion working copy.
+
+ Patches to existing files should be generated with svn diff filename and save the output to a file.
+ If you want to get the changes made to multiple files in a directory, just use svn diff.
+ then, tar and gzip the patch file as well as any new files that you have added.
+
If you use a unix shell, you may find the following following + sequence of commands useful for building the files to attach.
++ If you are working on a Git clone of Apache POI (see the + Version Control page for + more on the read-only Git mirrors), it is possible to generate + a patch of your changes (including new binary files) using Git. +
++ For new developers, we'd normally suggest using Subversion and + one of the methods above, as they tend to be simpler. For people + who are already proficient with Git, then generating a patch + from Git can be an easy way to contribute! +
++ When generating a patch / patch set from Git, for many related and + small changes a squashed patch is probably best, as it makes the + (manual) review quicker. For larger changes, several distinct + patches are probably best. +
++ If you intend to do a noticeable amount of work enhancing Apache POI + on your own Git repo, we would suggest sending in patches early and + asking for advice. There's nothing worse than spending a week working + hard on your own on a change, only to discover you did something on + Day 1 that isn't acceptable to the project meaning your whole patch + needs re-doing... Git's offline workflow makes this easier, so try not + to fall into that trap! +
+@author tags.@Disabled from org.junit for in-progress work).svn diff.The long standing + Minimal + Coding Standards from 2002 still largely apply to the project.
+When making changes to an existing file, please try to follow the + same style that that file already uses. This will keep things + looking similar, and will prevent patches becoming largely about + whitespace. Whitespace fixing changes, if needed, should normally be + in their own commit, so that they don't crowd out coding changes + in review.
+Normally, tabs should not be used to indent code. Instead, spaces + should be used. If starting on a fresh file, please use 4 spaces to + indent your code. If working on an existing file, please use + whichever of 3 or 4 spaces that file already follows.
+Normally, braces should open on the same line as the decision + statement. Braces should normally close on their own line. Brackets + should normally have a space before them when they are the first.
+Lines normally shouldn't be too long. There's no hard and fast rule, + but if you line is getting above about 90 characters think about + splitting it, and you should rarely create something over about 100 + characters without a very good reason!
+The POI project will generally offer committership to contributors who send + in consistently good patches over a period of several months.
+The requirement for "good patches" generally means patches which can be applied + to SVN with little or no changes. These patches should include unit test, and + appropriate documentation. Whilst your first patch to POI may require quite a + bit of work before it can be committed by an existing committer, with any luck + your later patches will be applied with no / minor tweaks. Please do take note + of any changes required by your earlier patches, to learn for later ones! If + in doubt, ask on the dev mailing list.
+The requirement for patches over several months is to ensure that committers + remain with the project. It's very easy for a good developer to fire off half + a dozen good patches in the couple of weeks that they're working on a POI + powered project. However, if that developer then moves away, and stops + contributing to POI after that spurt, then they're not a good candidate for + committership. As such, we generally require people to stay around for a while, + submitting patches and helping on the mailing list before considering them + for committership.
+Where possible, patches should be submitted early and often. For more details + on this, please see the "Submitting Patches" section above.
+ +Where possible, the existing developers will try to help and mentor new + contributors. However, everyone involved in POI is a volunteer, and it may + happen that your first few patches come in at a time when all the committers + are very busy. Do please have patience, and remember to use the + dev mailing list so that other + contributors can assist you!
+For more information on getting started at Apache, mentoring, and local + Apache Committers near you who can offer advice, please see the + Apache Community Development + Project website.
++ In early 2008, Microsoft made a fairly complete set of documentation + on the binary file formats freely and publicly available. These were + released under the + Open Specification Promise, which does allow us to use them for + building open source software under the + Apache Software License. +
++ You can download the documentation on Excel, Word, PowerPoint and + Escher (drawing) from + http://msdn.microsoft.com/en-us/library/cc313118.aspx. + Documentation on a few of the supporting technologies used in these + file formats can be downloaded from + http://msdn.microsoft.com/en-us/library/jj633110.aspx. +
++ For the VSDX format (implemented in Apache POI as XDGF), an + introduction + is available from Microsoft, and full details are available + here + and + here. +
++ Previously, Microsoft published a book on the Excel 97 file format. + It can still be of plenty of use, and is handy dead tree form. Pick up + a copy of "Excel 97 Developer's Kit" from your favourite second hand + book store. +
++ The newer Office Open XML (ooxml) file formats are documented as part + of the ECMA / ISO standardisation effort for the formats. This + documentation is quite large, but you can normally find the bit you + need without too much effort! This can be downloaded from + https://ecma-international.org/publications-and-standards/standards/ecma-376/, + and is also under the + OSP. +
++ Additionally for the newer Office Open XML (ooxml) file formats, you can + find some good introductary documentation (often clearer for getting + started with) at officeopenxml.com, + which is an independent site documenting the file formats. +
++ It is also worth checking the documentation and code of the other + open source implementations of the file formats. +
++ In short, stay away, stay far far away. Implementing these file formats + in POI is done strictly by using public information. Most of this Public + Information currently comes from the documentation that Microsoft + makes freely available (see above). The rest of the public information + includes sources from other open source projects, books that state the + purpose intended is for allowing implementation of the file format and + do not require any non-disclosure agreement and just hard work. + We are intent on keeping it legal, by contributing patches you agree to + do the same. +
++ If you've ever received information regarding the OLE 2 Compound Document + Format under any type of exclusionary agreement from Microsoft, or + received such information from a person bound by such an agreement, you + cannot participate in this project. Sorry. Well, unless you can persuade + Microsoft to release you from the terms of the NDA on the grounds that + most of the information is now publicly available. However, if you have + been party to a Microsoft NDA, you will need to get clearance from Microsoft + before contributing. +
++ Those submitting patches that show insight into the file format may be + asked to state explicitly that they have only ever read the publicly + available file format information, and not any received under an NDA + or similar, and have only made us of the public documentation. +
+The change log for the current release can be found in the home section.
+The change log for the current release can be found in the home section.
+Refer to the explanation on Wikipedia + for some folklore about how the name "POI" came into existence. +
+The POI project was dreamed up back around April 2001, when + Andrew Oliver landed a short term contract to do Java-based + reporting to Excel. He'd done this project a few times before + and knew right where to look for the tools he needed. + Ironically, the API he used to use had skyrocketed from around + $300 ($US) to around $10K ($US). He figured it would take two + people around six months to write an Excel port so he + recommended the client fork out the $10K. +
+ +Around June 2001, Andrew started thinking how great it would + be to have an open source Java tool to do this and, while he + had some spare time, he started on the project and learned + about OLE 2 Compound Document Format. After hitting some real + stumpers he realized he'd need help. He posted a message to + his local Java User's Group (JUG) and asked if anyone else + would be interested. He lucked out and the most talented Java + programmer he'd ever met, Marc Johnson, joined the project. He + ran rings around Andrew at porting OLE 2 CDF and rewrote his + skeletal code into a more sophisticated library. It took Marc + a few iterations to get something they were happy with. +
+ +While Marc worked on that, Andrew ported XLS to Java, based + on Marc's library. Several users wrote in asking to read XLS + (not just write as had originally been planned) and one user + had special requests for a different use for POIFS. Before + long, the project scope had tripled. POI 1.0 was released a + month later than planned, but with far more features. Marc + quickly wrote the serializer framework and HSSF Serializer in + record time and Andrew banged out more documentation and worked + on making people aware of the project +
+ +Shortly before the release, POI was fortunate to come into + contact with Nicola -Ken- Barrozzi who gave them samples for + the HSSF Serializer and help uncover its unfortunate bugs + (which were promptly fixed). More recently, Ken ported most + of the POI project documentation to XML from Andrew's crappy + HTML docs he wrote with Star Office. +
+ +Around the same time as the release, Glen Stampoultzis + joined the project. Glen was ticked off at Andrew's flippant attitude + towards adding graphing to HSSF. Glen got so ticked off he decided to + grab a hammer and do it himself. Glen has already become an integral + part of the POI development community; his contributions to HSSF have + already started making waves. +
+ +Somewhere in there we decided to finally submit the project + to The Apache + Cocoon Project, only to discover the project had + outgrown fitting nicely into just Cocoon long ago. + Furthermore, Andrew started eyeing other projects he'd like to + see POI functionality added to. So it was decided to donate + the Serializers and Generators to Cocoon, other POI + integration components to other projects, and the POI APIs + would become part of Jakarta. It was a bumpy road but it + looks like everything turned out since you're reading this! +
+ +In Early 2007, we graduated from + Jakarta, and became + our own Top Level Project (TLP) within Apache.
++ POI 4.0 and later require JDK version 1.8 or later. JDK version 11 is required to compile module support. +
++ POI 3.11 and later 3.x versions require JDK version 1.6 or later. +
++ POI 3.5 to 3.10 required the JDK version 1.5 or later. + Versions prior to 3.5 required JDK 1.4+. +
++ The POI build system requires + Apache Forrest + to build the documentation. +
++ Specifically, the build has been tested to work with Forrest 0.9. When building with Forrest, + it is recommended to use Java 8. +
++ Remember to set the FORREST_HOME environment variable. +
++ The main Apache POI build was traditionally done with Apache Ant. + In 2021, we moved to using Gradle. + After checking out the POI code, you will find gradlew and + gradlew.bat. These command files are used for running Gradle on Linux/Mac and Windows respectively. + Gradlew checks if you the right version of Gradle installed and will install it if you don't. +
++ Note that our source releases no longer contain gradlew or gradlew.bat. You can install the Gradle tool + yourself and use it to build POI. +
++ The main targets of interest to our users are: +
+| Gradle Target | +Description | +
|---|---|
| clean | +Erase all build work products (ie. everything in the + build directory | +
| test | +Run all unit tests from main, ooxml and scratchpad | +
| jar | +Produce jar files | +
| jenkins | ++ Runs the tests which Jenkins, our Continuous Integration system, does. This includes the unit tests and various code quality checks. + Also, packages up the jars and build distributions. + | +
+ To run the tests from just one test class, use a command like: +
++ ./gradlew poi-ooxml:test --tests *TestXSSFBugs +
++ gradlew poi-ooxml:test --tests *TestXSSFBugs +
++ The example command runs tests in the poi-ooxml sub-project that match the name '*TestXSSFBugs'. + The '*' wildcard is useful to avoid typing the full Java package name. +
++ Apache POI no longer includes a pre-defined Eclipse project file. When importing the POI project, + your IDE should recognise that there is Gradle support and offer to do the build using that. +
++ First make sure that Java is set up properly and that you can execute the 'javac' executable in your shell. +
++ Next, open Eclipse and create either a local SVN repository, or a copy of the Git repository, + and import the project into Eclipse. +
++ Note: when executing junit tests from within Eclipse, you might need to set the system + property "POI.testdata.path" to the actual location of the 'test-data' directory to make + the test framework find the required test-files. A simple value of 'test-data' usually works. +
++ Import the Gradle project into your IDE. Execute a build to get all the dependencies and generated code + in place. +
++ Note: when executing junit tests from within IntelliJ, you might need to set the system + property "POI.testdata.path" to the actual location of the 'test-data' directory to make + the test framework find the required test-files. A simple value of 'test-data' usually works. +
+Linux: + help.ubuntu.com, + unix.stackexchange.com +
+Windows: + en.wikipedia.org +
+The POI nightly builds are run on the Jenkins
+ continuous integration server.
+ These builds should not be used in production: they are mostly intended for use by
+ developers to help with resolving bugs and evaluating new features or users who want to try out the
+ latest version.
+
This is a collection of notes to assist with long-term planning and + development. +
+ +There is much discussion of issues and research topics (RT) threads on
+ the dev mailing list (and elsewhere). However, details
+ get lost in the sheer volume. This is the place to document the summary of
+ discussions on some key topics. Some new and complex capabilities will take
+ lots of design and specification before they can be implemented.
+
Another use for this collection of notes is as a place to quickly store + a snippet from an email discussion or even a link to a discussion thread. + The concepts can then be fleshed-out over time. +
+ +Anyone can participate in this process. Please get involved in discussion
+ on dev and contribute patches for these summary planning
+ documents via the normal contribution
+ process.
+
These planning documents are intended to be concise notes only. They are + also ever-evolving, because as issues are addressed these notes will be + revised. +
++ (21-Jan-02) While this document is just full of useful project + introductory information and I do suggest those interested in getting + involved in the project read it, it is woefully out of date. +
++ We deliberately allowed this document to run out of date because it + is a good reflection of what the original vision was for POI 1.0. + You'll note that some of the terminology is not used in quite the same + way any longer. I've made some minor corrections where reading this + confused me. An example: in some places this document may refer to + POI API instead of POIFS API. When this vision was written we had + an incomplete understanding of the project. +
++ Lastly, the scope of the project expanded dramatically near the end + of the 1.0 cycle. Our vision at the time was to focus merely on the + Excel port (having no idea how the project would grow or be received) + and provide the OLE 2 Compound Document port for others to port later + formats. We now plan to spearhead these ports under the umbrella of + the POI project. So, you've been warned. Read on, but just realize + that we had a fuzzy view of things to come, and hindsight is 20-20. +
++ If I recall major holes were: a complete understanding of the format + of OLE 2 Compound Document format, Excel file format, and exactly how + Cocoon 2 Serializers worked. (that just about covers the whole range + huh?) +
++ The purpose of this document is to + collect, analyze and define high-level requirements, user needs and + features of the HSSF Serializer for Cocoon 2 and related libraries. + The HSSF Serializer is a java class supporting the Serializer + interface from the Cocoon 2 project and outputting in a compatible + format of that used by the spreadsheet program Microsoft Excel '97. + The HSSF Serializer will be responsible for converting XML + spreadsheet-like documents into Excel-compatible XLS spreadsheets. +
++ Many web apps today hit a brick wall + when it comes to the user request that they be able to easily + manipulate their reports and data extracts in the popular Microsoft + Excel spreadsheet format. This often causes inferior technologies to be + chosen for the project simply because they easily support this + format. This project seeks to extend existing XML, Java and Apache + Cocoon 2 project technologies by: +
+ ++ There are a number of enthusiastic + users of XML, UNIX and Java technology. Secondly, the Microsoft + solution for outputting Office Document formats often involves + actually manipulating the software as an OLE Server. This method + provides extremely low performance, extremely high overhead and is + only capable of handling one document at a time. +
++ The users of this software shall be + developers in a Java environment on any Operating System or power + users who are capable of XML document generation/deployment. +
++ The OLE 2 Compound Document format is + undocumented for all practical purposes and cryptic for all + impractical purposes. Developer needs in this area include + documentation and an easy to use library for reading and writing in + this format without requiring the developer to have intimate + knowledge of the format. +
++ There is currently no good way to write + to Microsoft Excel documents from Java or from a non-Microsoft + Windows based platform for that matter. Developers need an easy to + use library that supports a reasonable feature set and allows + separation of data from formatting/stylistic concerns. +
++ There is currently no good way to + transform XML data to Microsoft Excel. Apache's Cocoon 2 project + supplies a complete framework for XML, but nothing for outputting in + Excel's XLS format. Developers and power users alike need a simple + method to output XML documents to Excel through server-side + processing. +
+ + ++ The produced code shall be licensed by + the Apache License as used by the Cocoon 2 project and maintained on + a project page until such time as the Cocoon 2 developers accept it + as a donation (at which time the copyright will be turned over to + them). +
++ For developers on a Java and/or XML + environment this project will provide all the tools necessary for + outputting XML data in the Microsoft Excel format. This project seeks + to make the use of Microsoft Windows based servers unnecessary for + file format considerations and to fully document the OLE 2 Compound + Document format. The project aims not only to provide the tools for + serializing XML to Excel's file format and the tools for writing to + that file format from Java, but also to provide the tools for later + projects to convert other OLE 2 Compound Document formats to pure + Java APIs. +
++ HSSF Serializer for Apache Cocoon 2 +
+| + Benefit + | ++ Supporting Features + | +
| + Standard XML tag language for sheet data + | ++ Serializer will transform documents utilizing a defined tag + language + | +
| + Utilize XML to output in Excel + | ++ Serializer will output in Excel + | +
| + Java API to output in Excel on any platform + | ++ The project will develop an API that outputs in Excel using + pure Java. + | +
| + Make it easy for developers to port other OLE 2 Compound + Document-based formats to Java. + | ++ The POIFS library will contain both a high-level abstraction + along with low-level constructs. The project will fully document + the OLE 2 Compound Document Format. + | +
+ The POIFS API will include: +
++ The HSSF API will include: +
++ The POI Filesystem API includes: +
++ The HSSF API includes: +
++ The HSSF Serializer subproject: +
++ All Java code will be 100% pure Java. +
++ The minimum system requirements for POIFS are: +
++ The minimum system requirements for HSSF are: +
++ The minimum system requirements for the HSSF Serializer are: +
++ All components must perform well enough + to be practical for use in a webserver environment (especially + Cocoon2/Tomcat/Apache combo) +
++ The software will run primarily in + developer environments. We should make some allowances for + not-highly-technical users to write XML documents for the HSSF + Serializer. All other components will assume intermediate Java 2 + knowledge. No XML knowledge will be required except for using the + HSSF Serializer. As much documentation as is practical shall be + required for all components as XML is relatively new, and the + concepts introduced for writing spreadsheets and to POI filesystems + will be brand new to Java and many Java developers. +
++ The filesystem as read and written by + POI shall be fully documented and explained so that the average Java + developer can understand it. +
++ The POI API will be fully documented + through Javadoc. A walkthrough of using the high level POI API shall + be provided. No documentation outside of the Javadoc shall be + provided for the low-level POI APIs. +
++ The HSSF File Format as implemented by + the HSSF API will be fully documented. No documentation will be + provided for features that are not supported by HSSF API that are + supported by the Excel 97 File Format. Care will be taken not to + infringe on any "legal stuff". +
++ The HSSF API will be documented by + javadoc. A walkthrough of using the high level HSSF API shall be + provided. No documentation outside of the Javadoc shall be provided + for the low level HSSF APIs. +
++ The HSSF Serializer will be documented + by javadoc. +
++ The XML tag language along with + function and usage shall be fully documented. Examples will be + provided as well. +
++ filesystem shall refer only to the POI formatted archive. +
++ file shall refer to the embedded data stream within a + POI filesystem. This will be the actual embedded document. +
++ This is the POI 2.0 cycle vision document. Although the vision + has not changed and this document is certainly not out of date and + the vision has not changed, the structure of the project has + changed a bit. We're not going to change the vision document to + reflect this (however proper that may be) because it would only + involve deletion. There is no purpose in providing less + information provided we give clarification. +
++ This document was created before the POI components for + Apache Cocoon + were accepted into the Cocoon project itself. It was also + written before POI was accepted into Jakarta. So while the + vision hasn't changed some of the components are actually now + part of other projects. We'll still be working on them on the + same timeline roughly (minus the overhead of coordination with + other groups), but they are no longer technically part of the + POI project itself. +
++ The purpose of this document is to + collect, analyze and define high-level requirements, user needs, + and features of the second release of the POI project software. + The POI project currently consists of the following components: + the HSSF Serializer, the HSSF library and the POIFS library. +
+By the completion of this release cycle the POI project will also + include the HSSF Generator and the HWPF library. +
++ The first release of the POI project + was an astounding success. This release seeks to build on that + success by: +
++ There are a number of enthusiastic + users of XML, UNIX and Java technology. Furthermore, the Microsoft + solution for outputting Office Document formats often involves + actually manipulating the software as an OLE Server. This method + provides extremely low performance, extremely high overhead and is + only capable of handing one document at a time. +
++ The users of this software shall be + developers in a Java environment on any operating system, or power + users who are capable of XML document generation/deployment. +
++ The HSSF library currently requires a + full object representation to be created before reading values. This + results in very high memory utilization. We need to reduce this + substantially for reading. It would be preferable to do this for + writing, but it may not be possible due to the constraints imposed by + the file format itself. Memory utilization during read is our top + user complaint. +
++ The POIFS library currently requires a + full object representation to be created before reading values. This + results in very high memory utilization. We need to reduce this + substantially for reading. +
++ The HSSF library currently ignores + formula cells and identifies them as "UnknownRecord" at the + lower level of the API. We must provide a way to read and write + formulas. This is now the top requested feature. +
++ The HSSF library currently does not support + charts. This is a key requirement of some users who wish to use HSSF + in a reporting engine. +
++ The HSSF Serializer currently does not + provide serialization for cell styling. User's will want stylish + spreadsheets to result from their XML. +
++ There is currently no way to generate + the XML from an XLS that is consistent with the format used by the + HSSF Serializer. +
++ There should be a way to read and write + the DOC file format using pure Java. +
+ ++ The produced code shall be licensed by + the Apache License as used by the Cocoon 2 project (APL 1.1) and + maintained on at http://poi.sourceforge.net + and http://sourcefoge.net/projects/poi. + It is our hope to at some point integrate with the various Apache + projects (xml.apache.org and jakarta.apache.org), at which point we'd + turn the copyright over to them. +
++ For developers on a Java and/or XML + environment this project will provide all the tools necessary for + outputting XML data in the Microsoft Excel format. This project seeks + to make the use of Microsoft Windows based servers unnecessary for + file format considerations and to fully document the OLE 2 Compound + Document format. The project aims not only to provide the tools for + serializing XML to Excel and Word file formats and the tools for + writing to those file formats from Java, but also to provide the + tools for later projects to convert other OLE 2 Compound Document + formats to pure Java APIs. +
++ HSSF Serializer for Apache Cocoon 2 +
+| + Benefit + | ++ Supporting Features + | +
|---|---|
| + Ability to serialize styles from XML spreadsheets. + | ++ HSSFSerializer will support styles. + | +
| + Ability to read and write formulas in XLS files. + | ++ HSSF will support reading/writing formulas. + | +
| + Ability to output in MS Word on any platform using Java. + | ++ The project will develop an API that outputs in Word format + using pure Java. + | +
| + Enhance performance for reading and writing XLS files. + | ++ HSSF will undergo a number of performance enhancements. HSSF + will include a new event-based API for reading XLS files. POIFS + will support a new event-based API for reading OLE2 CDF files. + | +
| + Ability to generate XML from XLS files + | ++ The project will develop an HSSF Generator. + | +
| + The ability to generate charts + | ++ HSSF will provide low level support for chart records as well + as high level API support for generating charts. The ability + to read chart information will not initially be provided. + | +
+ Enhancements to the POIFS API will + include: +
++ Enhancements to the HSSF API will + include: +
++ The HSSF Generator will include: +
++ The HWPF API will include: +
++ All Java code will be 100% pure Java. +
++ The minimum system requirements for the POIFS API are: +
++ The minimum system requirements for the HSSF API are: +
++ The minimum system requirements for the HWPF API are: +
++ The minimum system requirements for the HSSF Serializer are: +
++ All components must perform well enough + to be practical for use in a webserver environment (especially + the "killer trio": Cocoon2/Tomcat/Apache combo) +
++ The software will run primarily in + developer environments. We should make some allowances for + not-highly-technical users to write XML documents for the HSSF + Serializer. All other components will assume intermediate Java 2 + knowledge. No XML knowledge will be required except for using the + HSSF Serializer. As much documentation as is practical shall be + required for all components as XML is relatively new, and the + concepts introduced for writing spreadsheets and to POI filesystems + will be brand new to Java and many Java developers. +
++ The filesystem as read and written by + POI shall be fully documented and explained so that the average Java + developer can understand it. +
++ The POI API will be fully documented + through Javadoc. A walkthrough of using the high level POI API shall + be provided. No documentation outside of the Javadoc shall be + provided for the low-level POI APIs. +
++ The HSSF File Format as implemented by + the HSSF API will be fully documented. No documentation will be + provided for features that are not supported by HSSF API that are + supported by the Excel 97 File Format. Care will be taken not to + infringe on any "legal stuff". Additionally, we are + collaborating with the fine folks at OpenOffice.org on + *free* documentation of the format. +
++ The HSSF API will be documented by + javadoc. A walkthrough of using the high level HSSF API shall be + provided. No documentation outside of the Javadoc shall be provided + for the low level HSSF APIs. +
++ The HWPF API will be documented by + javadoc. A walkthrough of using the high level HWPF API shall be + provided. No documentation outside of the Javadoc shall be provided + for the low level HWPF APIs. +
++ The HSSF Serializer will be documented + by javadoc. +
++ The HSSF Generator will be documented + by javadoc. +
++ The XML tag language along with + function and usage shall be fully documented. Examples will be + provided as well. +
++ filesystem shall refer only to the POI formatted archive. +
++ file shall refer to the embedded data stream within a + POI filesystem. This will be the actual embedded document. +
++ See How to contribute to Poi. +
+ +
+ These are not necessarily deemed to be high enough quality to be included in the
+ core distribution, but they have been tested under
+ several key environments, they are provided under the same license
+ as Poi, and they are included in the POI distribution under the
+ contrib/ directory.
+
+ None as yet! - although you can expect that some of the links + listed below will eventually migrate to the "contributed components" level, and + then maybe even into the main distribution. +
+Submissions of modifications + to POI which are awaiting review. Anyone can + comment on them on the dev mailing list - code reviewers are needed! + Use these at your own risk - although POI has no guarantee + either, these patches have not been reviewed, let alone accepted. +
+The other extensions listed here are not endorsed by the POI + project either - they are provided as a convenience only. They may or may not work, + they may or may not be open source, etc. +
+ +To have a link added to this table, see How to contribute + to POI.
+ +| Name and Link | +Type | +Description | +Status | +Licensing | +Contact | +
|---|
Currently we don't have any sites listed that use POI, but we're + sure they're out there. Help us change this. If you've written a site + that utilises POI let us know.
+ +Publicly available products/projects using POI include:
+POI depends on publicly available documents describing various + file formats. The list below contains links to some of them.
++ Here are the current logo submissions. Thanks to the artists! +
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Contact Person: Fancy at: fancy at my-feiqi.com +
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Every project in Apache has resolutions that they vote on. + Decisions are made, etc. But what happens once those decisions + are made? They are archived in the mail list archive never to + be read again (once its not in the top 10 or so posts). So they + get discussed again and again. +
++ Rather than have that big waste of time, we have this section to + record important POI decisions. Once a decision is passed it + need only be linked to this page (either by creating a page for + it or by simply linking it to the archive messages). Wherever + possible a brief about how many votes for and against an maybe + some background should be posted. +
++ This section is intended mainly to reduce big waste of time + discussions from taking away from whats important...developing + POI! :-D +
++ As the POI project has grown the "styles" used have become more + varied, some see this as a bad thing, but in reality it + can be a good thing. Each can learn from the different + styles by working with different code. That being said + there are some universal "good quality" guidelines that + must be adopted on a project of any proportions. +
++ Marc Johnson Authored the following resolution: +
++ On Tue, 2002-01-08 at 22:23, Marc Johnson wrote: + Standards are wonderful; everyone should have a set. + Here's what I propose for coding standards for POI WRT comments (should I + feel the need, I'll post more of these little gems): +
++ As opposed to the formerly used POI License (which was + based on the Apache Public License), now that POI is + part of Apache, use the standard Apache Software + License 2.0 header. As per standard Apache Software + Foundation policy, the full (long) version of the + header should be used. +
++ Tip: No laughing or joking allowed in conversations regarding coding + standards. + Any mail on coding standards will be treated very seriously, + and sent here with a RTFM. +
++ The motion was passed unanimously with no negative or + neutral votes. +
++ Andy didn't feel like going through his mail and sucking + out the comments.. If there is anything you feel should + be added here do it yourself ;-). +
++ Most users of the source code probably don't need to have day to + day access to the source code as it changes. Therefore most users will want + to make use of our source release + packages, which contain the complete source tree for each binary + release, suitable for browsing or debugging. These source releases + are available from our + download page. +
++ The Apache POI source code is also available as source artifacts + in the Maven Central repository, + which may be helpful for those users who make use of POI via Maven + and wish to inspect the source (eg when debugging in an IDE). +
++ For general information on connecting to the ASF Subversion, + repositories, see the + version control page. +
+ +Apache POI uses Subversion as its version control system, + but also has a read-only git mirror +
+ +NOTE: When checking out a subproject using + subversion, either perform a sparse checkout or check out + the trunk or a single branch or tag to avoid filling up + your hard-disk and wasting bandwidth. +
+ +If you are not a Committer, but you want to submit patches + or even request commit privileges, please see our + Contribution Guidelines for more + information.
++ The master source repository for Apache POI is the Subversion + one listed above. To support those users and developers who prefer + to use the Git tooling, read-only access to the POI source tree is + also available via Git. The Git mirrors normally track SVN to + within a few minutes. +
++ The official read-only Git repository for Apache POI is available + from git.apache.org/ . + The Git Clone URL is: git://git.apache.org/poi.git + and Https Clone URL: https://git.apache.org/poi.git . + Please see the Git at + Apache page for more details on the service. +
++ In addition to the git.apache.org + repository, changes are also mirrored in near-realtime to GitHub. + The GitHub repository is available at + https://github.com/apache/poi . + Please note that the GitHub repository is read-only, but pull requests sent + to it will result in an email being sent to the mailing list. A Git-formatted + patch added to Bugzilla is generally preferred though, as it can be tracked + along with all the other contributions. Please see the + contribution guidelines for more + information on getting involved in the project.
++ Git provides a nice functionality "git-svn" which allows to read the history + of a Subversion repository and convert it into a full Git repository. This + will keep information from the SVN revisions so that the Git repository can + be updated with newer revisions from Subversion as well as allowing to push + commits from Git "upstream" into the Subversion repository. See the + + official documentation for more details. +
++ The git-svn functionality is provided as a set of sub-commands to + "git svn". To start retrieving information from SVN and create the + initial Git repository run the following command: + +
+
+ Running without --revision from:HEAD will run for a long time and will retrieve the full version history of
+ the Subversion repository. If you need more repository history, change the from revision to an
+ earlier release or omit the --revision
+ specifier altogether.
+
+ When this finishes you have a Git repository whose "master" branch
+ mirrors the SVN "trunk".
+
+ From here you can use the full power of Git, i.e. quick branching,
+ rebasing, merging, ...
+
+ See below for some common usage hints.
+
+ In order to fetch the latest SVN revisions, you need to "rebase" onto + the SVN trunk: +
++ This will fetch the latest changes from Subversion and will rebase + the master-branch onto them. +
+
+ The following command will push all changes on master back to
+ Subversion:
+
+ Note that usually all commits on master will be sent to Subversion
+ in one go, so it's similar to a "push" to another Git repository.
+
+ The dcommit may fail if there are newer revisions in Subversion, you
+ will need to run a git svn rebase first in this case.
+
+ Although you can use the full power of Git, there are a few + things that work well and some things that will get you into + trouble: +
++ You should not develop on master, rather use some branching + concept where you do work on sub-branches and only merge/cherry-pick the + changes that are ready for being sent upstream. + It seems to work better to constantly rebase changes onto the + master branch as this will keep the history clean compared to + the SVN repository and will avoid sending useless "Merge" commits to + Subversion. +
++ You can keep some changes that are only useful locally by using + two branches that are rebased onto each other. E.g. + something like the following has proven to work well: +
++ When things are ready in the workbranch do a +
+
+ to get all the finished commits onto master as preparation for pushing them upstream.
+
+ Then you can git svn dcommit to send the changes upstream
+ and a git svn rebase to get master updated with the newly
+ created SVN revisions.
+
+ Finally do the following to update both branches onto the new SVN head
+
+ Sounds like too much work? Put these steps into a small script and all
+ this will become a simple poiupdate to get all branches
+ rebased onto HEAD from Subversion.
+
+ Code quality reports for Apache POI are available on the + Apache Sonar instance. +
++ Sonar provides lots of useful numbers and statistics, especially + watching the project over time shows how some of the indicators evolve + and allows to see which areas need some polishing. +
++ The Apache POI Project operates on a meritocracy: the more you do, the more + responsibility you will obtain. This page lists all of the people who have + gone the extra mile and are Committers. If you would like to get involved, + the first step is to join the mailing lists. +
+ ++ We ask that you please do not send us emails privately asking for support. + We are non-paid volunteers who help out with the project and we do not + necessarily have the time or energy to help people on an individual basis. + The mailing lists have many individuals + who will help answer detailed requests for help. The benefit of + using mailing lists over private communication is that they are a shared + resource where others can also learn from common questions. +
++ POI Developers count on feedback from the mailing lists. Many developers do take + an active role on the lists. +
+ + + + + + + + ++ So you took the time to report a bug, provided information that should make + it possible to reproduce the problem and fix it. Surely the fix is easy and + should take a seasoned developer a few minutes at max to fix! + + So why is there no progress on your bug report? Is there nobody + taking care when your problem is clearly stopping nearly everybody + from using POI? + + We know that the absence of responses on bug-reports can be frustrating, + sometimes bugs lie dormant for a long time for no apparent reason. + + Please always remember: nobody is paid to work on POI, the team is + a bunch of volunteers who look at things in their free time. + + Because of that developers might choose to work on things based on a + different priority than yours! Especially the quality and maturity of + bug reports will affect if somebody decides to look at it. + + So the best way to help a bug report see progress is to provide more information + if available or supply patches together with unit-tests. + + If you can, look at Contribution Guidelines + for more information about providing patches. +
++ This page provides instructions on how to download and verify the Apache POI release artifacts. There + are different versions available depending on how stable your code should be. +
+ ++ Apache POI releases are available under the + Apache License, Version 2.0. + See the NOTICE file contained in each release artifact for applicable copyright attribution notices. +
++ To ensure that you have downloaded the true release you should + verify the integrity + of the files using the signatures and checksums available from this page. +
+The Apache POI team is pleased to announce the release of 5.4.1. + Featured are a handful of new areas of functionality and numerous bug fixes.
+A summary of changes is available in the + Release Notes. + A full list of changes is available in the change log. + People interested should also follow the dev list + to track progress.
++ The POI source release is listed below. + Pre-built versions of all POI components + are available in the central Maven repository under Group ID "org.apache.poi" and Version + "5.4.1". +
++ POI 5.2.3 was the last version where we produced a set of poi-bin*.zip and poi-bin*.tgz files. + We will continue to publish jars to Maven Central. If you are not using a build tool like + Apache Maven or Gradle, you can still find these jars by traversing the directories at + https://repo1.maven.org/maven2/org/apache/poi/. +
++ If you want to download a legacy poi-bin archive, see the + archives of all prior releases. +
++ It is essential that you verify the integrity of the downloaded files using the PGP and SHA2 signatures. + Please read + Verifying Apache HTTP Server Releases + for more information on why you should verify our releases. This page provides detailed instructions + which you can use for POI artifacts. +
++ The PGP signatures can be verified using PGP or GPG. First download the + KEYS + file as well as the .asc signature files for the relevant release packages. Make sure you get these + files from the main distribution directory, rather than from a mirror. + Then verify the signatures. +
+Batch check of all distribution files:
+Sample verification of poi-bin-3.5-FINAL-20090928.tgz
++ Apache POI became a top level project in June 2007 and POI 3.0 artifacts were re-released. Prior to that + date POI was a sub-project of + Apache Jakarta. +
+Apache POI contains support for reading few variants of encrypted office files:
+Some "write-protected" files are encrypted with the built-in password "VelvetSweatshop", POI can read that files too.
+| Encryption | +HSSF | +HSLF | +HWPF | +
|---|---|---|---|
| XOR obfuscation *) | +Yes (Writing since 3.16) | +N/A | +No | +
| 40-bit RC4 encryption | +Yes (Writing since 3.16) | +N/A | +Yes (since 3.17) | +
| Office Binary Document RC4 CryptoAPI Encryption | +Yes (Since 3.16) | +Yes | +Yes (since 3.17) | +
| + | XSSF | +XSLF | +XWPF | +
| Office Binary Document RC4 Encryption **) | +Yes | +Yes | +Yes | +
| ECMA-376 Standard Encryption | +Yes | +Yes | +Yes | +
| ECMA-376 Agile Encryption | +Yes | +Yes | +Yes | +
| ECMA-376 XML Signature | +Yes | +Yes | +Yes | +
*) the xor encryption is flawed and works only for very small files - see #59857. +
+ +**) the MS-OFFCRYPTO + documentation only mentions the RC4 (without CryptoAPI) encryption as a "in place" encryption, but + apparently there's also a container based method with that key generation logic. +
+As mentioned above, use + + Biff8EncryptionKey.setCurrentUserPassword(String password) + to specify the password.
+ +XML-based formats are stored in OLE-package stream "EncryptedPackage". Use org.apache.poi.poifs.crypt.Decryptor + to decode file:
+ +If you want to read file encrypted with build-in password, use Decryptor.DEFAULT_PASSWORD.
+Encrypting a file is similar to the above decryption process. Basically you'll need to choose between + binaryRC4, standard and agile encryption, + the cryptoAPI mode is used internally and its direct use would result in an incomplete file. + Apart of the CipherMode, the EncryptionInfo class provides further parameters to specify the cipher and + hashing algorithm to be used.
+An Office document can be digital signed by a XML Signature + to protect it from unauthorized modifications, i.e. modifications without having the original certificate. + The current implementation is based on the + eID Applet which + is dual-licensed to + Apache License 2.0 and LGPL v3.0. + Instead of using the internal JDK API + this version is based on Apache Santuario.
+The classes have been tested against the following libraries, which need to be included additionally to the + default dependencies:
+Depending on the configuration + and the activated facets + various XAdES levels are supported - the support for higher levels (XAdES-T+) + depend on supporting services and although the code is adopted, the integration is not well tested ... please support us on + integration (testing) with timestamp and revocation (OCSP) services. +
+Further test examples can be found in the corresponding test class.
+ +If you want to use a hash algorithm with 64 bytes (currently only applies to SHA512),
+ a base64 "feature" in xmlsec
+ leads to line breaks in the digest values, which won't be accepted by Office. To workaround this, you
+ need to set the following system property:
+ -Dorg.apache.xml.security.ignoreLineBreaks=true
When saving a OOXML document, POI creates missing relations on the fly. Therefore calling the signing method before + would result in an invalid signature. Instead of trying to fix all save invocations, the user is asked to save the stream + before in an intermediate byte array (stream) and process this stream instead.
+ +For security-conscious environments where data at rest must be stored encrypted, + the creation of plaintext temporary files is a grey area.
+ +The code example, written by PJ Fanning, modifies the behavior of SXSSFWorkbook + to extract an OOXML spreadsheet zipped container and write the contents to disk using AES + encryption.
+ +See SXSSFWorkbookWithCustomZipEntrySource.java + and other files + that are needed for this example.
+Finding the source of a XML signature problem can be sometimes a pain in the ... neck, because + the hashing of the canonicalized form is more or less done in the background.
+ + +One of the tripping hazards are different + linebreaks in Windows/Unix, therefore use the non-indent form of the xmls. Furthermore the + elements/ancestors containing namespace definitions and the used prefix might also differ.
+ +The next thing is to compare successful signed documents from Office vs. POIs generated signature, + i.e. unzip both files and look for differences. Usually the package relations (*.rels) will be different, + and the sig1.xml, core.xml and [Content_Types].xml due to different order of the references.
+ +The package relationships (*.rels) will be specially handled, i.e. they will be filtered and only + a subset will be processed - see 13.2.4.24 Relationships Transform Algorithm.
+ +POI and Santuario (XmlSec) use Log4J 2.x and + SLF4J respectively for logging.
+ +You almost certainly have an older version of Apache POI + on your classpath. Quite a few runtimes and other packages + will ship older version of Apache POI, so this is an easy problem + to hit without your realising. Some will ship just one old jar, + some may ship a full set of old POI jars.
+The best way to identify the offending earlier jar files is + with a few lines of java. These will load a Core POI class, an + OOXML class and a Scratchpad class, and report where they all came + from.
+You almost certainly have an older version earlier on your + classpath. See the prior answer.
+To use the new OOXML file formats, POI requires a jar containing + the file format XSDs, as compiled by + XMLBeans. These + XSDs, once compiled into Java classes, live in the + org.openxmlformats.schemas namespace.
+There are two jar files available, as described in + the components overview section. + The full jar of all of the schemas is poi-ooxml-full-XXX.jar (previously known as ooxml-schemas) + (lower versions for older releases, see table below), + and it is currently around 16mb. The smaller poi-ooxml-lite (previously known as poi-ooxml-schemas) + jar is only about 6mb. This latter jar file only contains the + typically used parts though.
+Many users choose to use the smaller poi-ooxml-lite jar to save + space. However, the poi-ooxml-lite jar only contains the XSDs and + classes that are typically used, as identified by the unit tests. + Every so often, you may try to use part of the file format which + isn't included in the minimal poi-ooxml-lite jar. In this case, + you should switch to the full poi-ooxml-full jar. Longer term, + you may also wish to submit a new unit test which uses the extra + parts of the XSDs, so that a future poi-ooxml-lite jar will + include them.
+There are a number of ways to get the full poi-ooxml-full jar. + If you are a maven user, see the + the components overview section + for the artifact details to have maven download it for you. + If you download the source release of POI, and/or checkout the + source code from subversion, + then you can run the ant task "compile-ooxml-xsds" to have the + OOXML schemas downloaded and compiled for you (This will also + give you the XMLBeans generated source code, in case you wish to + look at this). Finally, you can download the jar by hand from the + POI + Maven Repository.
+Note that historically, different versions of poi-ooxml-full / ooxml-schemas were + used
+ +| Version of ooxml-schemas | +Version of POI | +Commment | +
|---|---|---|
| ooxml-schemas-1.0.jar | +POI 3.5 and 3.6 | ++ |
| ooxml-schemas-1.1.jar | +POI 3.7 to POI 3.13 | +Generics support added, can be used with POI 3.5 and POI 3.6 as well | +
| ooxml-schemas-1.2.jar | +- | +Not released | +
| ooxml-schemas-1.3.jar | +POI 3.14 and newer | +Visio XML format support added, can be used with POI 3.7 - POI 3.13 as well | +
| ooxml-schemas-1.4.jar | +POI 4.*.* | +Provide schema for AlternateContent, can be used with previous versions of POI as well | +
| poi-ooxml-full jar | +POI 5.0.0 and newer | +Upgrade to ECMA-376 5th edition - which is not downward compatible | +
You've probably enabled logging. Logging is intended only for + autopsy style debugging. Having it enabled will reduce performance + by a factor of at least 100. Logging is helpful for understanding + why POI can't read some file or developing POI itself. Important + errors are thrown as exceptions, which means you probably don't need + logging.
+The SS eventmodel package is an API for reading Excel files without loading the whole spreadsheet into memory. It does + require more knowledge on the part of the user, but reduces memory consumption by more than + tenfold. It is based on the AWT event model in combination with SAX. If you need read-only + access, this is the best way to do it.
+Star Office 5.1 writes some records using the older BIFF standard. This causes some problems + with POI which supports only BIFF8.
+It's possible your spreadsheet contains a feature that is not currently supported by POI. + If you encounter this then please create the simplest file that demonstrates the trouble and submit it to + Bugzilla.
+Excel stores dates as numbers therefore the only way to determine if a cell is + actually stored as a date is to look at the formatting. There is a helper method + in HSSFDateUtil that checks for this. + Thanks to Jason Hoffman for providing the solution.
++ The problem usually manifests itself as the junk characters being shown on + screen. The problem persists even though you have set the correct mime type. +
++ The short answer is, don't depend on IE to display a binary file type properly if you stream it via a + servlet. Every minor version of IE has different bugs on this issue. +
++ The problem in most versions of IE is that it does not use the mime type on + the HTTP response to determine the file type; rather it uses the file extension + on the request. Thus you might want to add a + .xls to your request + string. For example + http://yourserver.com/myServelet.xls?param1=xx. This is + easily accomplished through URL mapping in any servlet container. Sometimes + a request like + http://yourserver.com/myServelet?param1=xx&dummy=file.xls is also + known to work. +
++ To guarantee opening the file properly in Excel from IE, write out your file to a + temporary file under your web root from your servlet. Then send an http response + to the browser to do a client side redirection to your temp file. (Note that using a + server side redirect using RequestDispatcher will not be effective in this case) +
++ Note also that when you request a document that is opened with an + external handler, IE sometimes makes two requests to the webserver. So if your + generating process is heavy, it makes sense to write out to a temporary file, so that multiple + requests happen for a static file. +
++ None of this is particular to Excel. The same problem arises when you try to + generate any binary file dynamically to an IE client. For example, if you generate + pdf files using + FOP, you will come across many of the same issues. +
+ ++ Yes. You first need to get a DataFormat object from the workbook and call getFormat with the desired format. Some examples are here. +
++ Yes. This is a built-in format for excel that you can get from DataFormat object using the format string "@". Also, the string "text" will alias this format. +
+Add blank cells around where the cells normally would have been and set the borders individually for each cell. + We will probably enhance HSSF in the future to make this process easier.
+You just create the styles OUTSIDE of the loop in which you create cells.
+GOOD:
+BAD:
+This one comes up quite a lot, but often the reason isn't what + you might initially think. So, the first thing to check is - what's + the source of the problem? Your file? Your code? Your environment? + Or Apache POI?
+(If you're here, you probably think it's Apache POI. However, it + often isn't! A moderate laptop, with a decent but not excessive heap + size, from a standing start, can normally read or write a file with + 100 columns and 100,000 rows in under a couple of seconds, including + the time to start the JVM).
+Apache POI ships with a few programs and a few example programs, + which can be used to do some basic performance checks. For testing + file generation, the class to use is in the examples package, + SSPerformanceTest + (viewvc). + Run SSPerformanceTest with arguments of the writing type (HSSF, XSSF + or SXSSF), the number rows, the number of columns, and if the file + should be saved. If you can't run that with 50,000 rows and 50 columns + in HSSF and SXSSF in under 3 seconds, and XSSF in under 20 seconds + (and ideally all 3 in less than that!), then the problem is with + your environment.
+Next, use the example program + ToCSV + (viewvc) + to try reading the file in with HSSF or XSSF. Related is + XLSX2CSV + (viewvc), + which uses SAX parsing for .xlsx. Run this against both your problem file, + and a simple one generated by SSPerformanceTest of the same size. If this is + slow, then there could be an Apache POI problem with how the file is being + processed (POI makes some assumptions that might not always be right on all + files). If these tests are fast, then performance problems likely are in your + code.
+The OOXML support in Apache POI is built on top of the file format + XML Schemas, as compiled into Java using + XMLBeans. Currently, + the compilation is done with XMLBeans 5.x, for maximum compatibility + with installations.
+All of the org.openxmlformats.schemas.spreadsheetml.x2006 CT... + classes are auto-generated by XMLBeans. The resulting generated Java goes + in the poi-ooxml-full-*-sources jar, and the compiled version into the + poi-ooxml-full jar.
+The full poi-ooxml-full jar is distributed with Apache POI, + along with the cut-down poi-ooxml-lite jar containing just + the common parts. Use the sources off poi-ooxml-full for the lite version, + which is available from Maven Central - ask your favourite Maven + mirror for the poi-ooxml-full-*-sources jar. Alternately, if you download + the POI source distribution (or checkout from SVN) and build, Ant will + automatically compile it for you to generate the source and binary poi-ooxml-full jars.
+The first thing to try is running the + Binary File Format Validator + from Microsoft against the file, which will report if the file + complies with the specification. If your input file doesn't, then this + may well explain why POI isn't able to process it correctly. You + should probably in this case speak to whoever is generating the file, + and have them fix it there. If your POI generated file is identified + as having an issue, and you're on the + latest codebase, report a new + POI bug and include the details of the validation failure.
+Another thing to try, especially if the file is valid but POI isn't + behaving as expected, are the POI Dev Tools for the component you're + using. For example, HSSF has org.apache.poi.hssf.dev.BiffViewer + which will allow you to view the file as POI does. This will often + allow you to check that things are being read as you expect, and + narrow in on problem records and structures.
+There's not currently a simple validator tool as there is for the + OLE2 based (binary) file formats, but checking the basics of a file + is generally much easier.
+Files such as .xlsx, .docx and .pptx are actually a zip file of XML + files, with a special structure. Your first step in diagnosing the + issues with the input or output file will likely be to unzip the + file, and look at the XML of it. Newer versions of Office will + normally tell you which area of the file is problematic, so + narrow in on there. Looking at the XML, does it look correct?
+When reporting bugs, ideally include the whole file, but if you're + unable to then include the snippet of XML for the problem area, and + reference the OOXML standard for what it should contain.
+Applies to versions <= 3.17 (Java 6):
+This error indicates that the class XMLEventFactory does not provide + functionality which POI is depending upon. There can be a number of + different reasons for this:
+No. This is not supported.
+All POI jars in use must come from the same version. A combination + such as poi-3.11.jar and poi-ooxml-3.9.jar is not + supported, and will fail to work in unpredictable ways.
+If you're not sure which POI jars you're using at runtime, and/or + you suspect it might not be the one you intended, see + this FAQ entry for details on + diagnosing it. If you aren't sure what POI jars you need, see the + Components Overview + for details
+In short: Handling different document-objects in different threads will + work. Accessing the same document in multiple threads will not work.
+This means the workbook/document/slideshow objects are not checked for + thread safety, but any globally held object like global caches or other + data structures are guarded against multi threaded access accordingly.
+There have been + discussions + about accessing different Workbook-sheets + in different threads concurrently. While this may work to some degree, it may lead + to very hard to track errors as multi-threading issues typically only + manifest after long runtime when many threads are active and the system + is under high load, i.e. in production use! Also it might break in future + versions of Apache POI as we do not specifically test using the library + this way.
+Across most of the UserModel classes ( +POIDocument +and +POIXMLDocument), + you can open the document from a read-only File, a read-write File + or an InputStream. You can always write out to an OutputStream, + and increasing also to a File. +
+Opening your document from a File is suggested wherever possible. + This will always be quicker and lower memory then using an InputStream, + as the latter has to buffer things in memory.
+When writing, you can use an OutputStream to write to a new file, or + overwrite an existing one (provided it isn't already open!). On slow links / disks, + wrapping with a BufferedOutputStream is suggested. To write like this, use +write(OutputStream). +
+To write to the currently open file (an in-place write / replace), you need to + have opened your document from a File, not an InputStream. In + addition, you need to have opened from the File in read-write mode, not + read-only mode. To write to the currently open file, on formats that support it + (not all do), use +write(). +
+You can also write out to a new File. This is available no matter how + you opened the document, and will create/replace a new file. It is faster and lower + memory than writing to an OutputStream. However, you can't use this to + replace the currently open file, only files not currently open. To write to a + new / different file, use +write(File) +
+More information is also available in the +HSSF and XSSF documentation, + which largely applies to the other formats too. +
+Note that currenly (POI 3.15 beta 3), not all of the write methods are available + for the OOXML formats yet. +
+Starting with POI 3.16 there's a workaround for OSGIs context classloader handling,
+ i.e. it replaces the threads current context classloader with an implementation of
+ limited class view. This will lead to IllegalStateExceptions, as xmlbeans can't find
+ the xml schema definitions in this reduced view. The workaround is to initialize
+ the classloader delegate of POIXMLTypeLoader , which defaults to the current
+ thread context classloader. The initialization should take place before any other
+ OOXML related calls. The class in the example could be any class, which is
+ part of the poi-ooxml-schema or ooxml-schema:
+ POIXMLTypeLoader.setClassLoader(CTTable.class.getClassLoader());
+
+ POI is successfully tested with many different versions of Java. It is + recommended that you use Java versions that have Long Term Support (Java 11, 17 and 21). +
+Including the existing binaries as normal jar-files + should work when using recent versions of Apache POI. You may see + some warnings about illegal reflective access, but it should work fine + despite those. We are working on getting the code changed so we avoid + discouraged accesses in the future. +
+NOTE: Apache POI tries to support the Java module system but it is more complicated + because Apache POI is still supporting Java 8 and the module system + cannot be fully supported while maintaining such support. +
++ FYI, jaxb in current versions also causes some warnings about reflective access, + we cannot fix those until jaxb >= 2.4.0 is available, see + https://stackoverflow.com/a/50251510/411846 for details, you can set a system + property "com.sun.xml.bind.v2.bytecode.ClassTailor.noOptimize" to avoid this warning. +
++ For compiling Apache POI, you should use at least version 4.1.0 when it becomes available + or a recent trunk checkout until then. +
++ If you are building POI yourself from source files, use an up to date version of Gradle. + If you use Ant, again check the Ant version supports the version of Java you are using. +
+Apache POI does not actively support Java 9 or Java 10 any longer as those versions were + obsoleted by Oracle already. See the previous FAQ entry for information about support for + Java LTS versions. +
+The IBM Java runtime is using a JIT compiler which doesn't behave sometimes. ;) + Especially when rendering slideshows it throws errors, which don't occur when debugging the code. + E.g. an ArrayIndexOutOfBoundsException is thrown in TexturePaintContext when the image contains + textures - see #62999 for more + details on how to detected JIT errors.
+To prevent the JIT errors, the affected methods need be excluded from JIT compiling.
+ Currently (tested with IBM JDK 1.8.0_144 and _191) the following should be added to the VM parameters:
+
Apache POI uses Java ThreadLocals + in order to cache some data when Apache POI is used in a multi-threading environment (see also the FAQ about thread-safety above!) +
+WebServers like Tomcat use thread-pooling to re-use threads to avoid the cost of frequent thread-startup and shutdown. + In order to guard against memory-leaks, Tomcat performs checks on allocated memory in ThreadLocals and reports them as warnings. +
+In order to get rid of these warnings, Apache POI, starting with version 5.2.4, provides a utility ThreadLocalUtils which can + be used to clear all objects held in thread-local objects before returning the thread back to the global pool. +
+Apache POI is an open source project developed by a very small group of volunteers. +
+Currently no-one is paid to work on new features or bug-fixes. +
+So it is considered fairly rude to "demand" things, especially "ASAP" is quite frowned + upon and may even reduce the likelihood that your issue is picked up and worked on. +
+If you would like to increase chances that your problem is tackled, you can do a number of things + as follows, sorted by the amount of effort which may be required from you: +
+There are two angles to reproducibility: building reproducible jars for Apache POI itself and making Apache POI + produce byte-for-byte identical files when it is used to create documents. +
+Please create a bug entry if you find things which break reproducibility, both for building and output files.
+ Please provide exact steps how to reproduce your issue!
+
See https://reproducible-builds.org/ for general information about why reproducible builds + and output may be important. +
++ Before subscribing or participating in any of the mailing + lists, we suggest you read and understand the following + guidelines: +
+ ++ Medium Traffic + View, + Participate and Subscribe to the Dev List +
++ This is the list where participating developers of the POI + project meet and discuss issues, code changes/additions, etc. + Subscribers to this list also get notices of each and every + code change, build results, testing notices, etc. + Do not send mail to this list with usage questions or + configuration problems. Use the POI User List or community sites + such as Stack Overflow, instead. +
++ Alternate options: + Subscribe + Unsubscribe + Old Archive + + Nabble + MarkMail +
++ Low Traffic + View, + Participate and Subscribe to the User List +
++ This list is for users of POI to ask questions, share knowledge, + and discuss issues. POI developers are also expected to be + lurking on this list to offer support to users of POI. +
++ Alternate options: + Subscribe + Unsubscribe + Old Archive + + Nabble + MarkMail +
++ Very Low Traffic + View, + Participate and Subscribe to the General List +
++ This list exists for general discussions on POI, not specific to + code or problems with code. Used for discussion of general matters + relating to all of the POI project, such as the website and + changes in procedures. +
++ Alternate options: + Subscribe + Unsubscribe + Old Archive +
++ There are many POI users in the Stack Overflow community who have asked + and answered questions that may be similar to the problem you are facing. + Search for the apache-poi + tag on Stack Overflow. +
+Regardless of which community you seek help from, remember to be courteous. + Short, working code examples, an explanation of observed and expected behavior, + the version of POI you are using, and genuine troubleshooting and research effort + on your part go a long way towards getting a helpful answer. +
+Please read through the FAQ, + Quick Guide, + How To or + Cookbook, and + Examples + of the POI module that you are trying to use before consulting help. You may also find your + question has already been answered on the POI dev + or user mailing lists, + bugzilla, +
+
+ While parsing of OOXML format files like xlsx, docx and pptx, a specially crafted file could
+ be used to provide multiple entries with the same name in the zip-compressed file-format.
+
+ Products reading the affected file could read different data because one of the zip entries with the
+ duplicate name is selected over another by different products differently.
+ This issue affects Apache POI component poi-ooxml before 5.4.0. Starting with 5.4.0 poi-ooxml performs
+ a check that throws an exception if zip entries with duplicate file names are found in the input file.
+ Users are recommended to upgrade to version poi-ooxml 5.4.0 or later, which fixes the issue.
+ Please refer to our security guidelines
+ for recommendations about how to use the POI libraries securely.
+
+ References: +
+The Apache POI team is pleased to announce the release of 5.4.1. + Several dependencies were updated to their latest versions to pick up security fixes and other improvements.
+A summary of changes is available in the + Release Notes. + A full list of changes is available in the change log. + People interested should also follow the dev list to track progress.
+See the downloads page for more details.
+POI requires Java 8 or newer since version 4.0.1.
+While testing a potential Apache POI 5.4.0 release, we discovered a serious bug in + log4j-api 2.24.1. This leads to NullPointerExceptions when you use a version of log4j-core that is not of + the exact same version (2.24.1). We recommend that users avoid log4j 2.24.1 and use the latest + 2.24.x version where this issue is fixed again.
+XMLBeans release 5.2.2 had the problematic log4j-api 2.24.1 dependency and thus + can lead to such issues if used in some other context. In the meantime a version 5.3.0 + of XmlBeans was released which avoids this issue.
+Please direct any queries to the Log4j Team. The main issue is + Issue 3143.
+Description:
+ A shortcoming in the HMEF package of poi-scratchpad (Apache POI) allows an attacker to cause an Out of Memory exception.
+ This package is used to read TNEF files (Microsoft Outlook and Microsoft Exchange Server).
+ If an application uses poi-scratchpad to parse TNEF files and the application allows untrusted users to supply them, then a carefully crafted file can cause an Out of Memory exception.
Mitigation:
+ Affected users are advised to update to poi-scratchpad 5.2.1 or above
+ which fixes this vulnerability. It is recommended that you use the same versions of all POI jars.
The Apache POI PMC has evaluated the security vulnerabilities reported + for Apache Log4j.
+POI 5.1.0 and XMLBeans 5.0.2 only have dependencies on log4j-api 2.14.1. + The security vulnerabilities are not in log4j-api - they are in log4j-core.
+If any POI or XMLBeans user uses log4j-core to control their logging of their application, + we strongly recommend that they upgrade all their log4j dependencies to the latest + version (currently v2.20.0) - including log4j-api.
+Description:
+ When parsing XML files using XMLBeans 2.6.0 or below, the underlying parser
+ created by XMLBeans could be susceptible to XML External Entity (XXE) attacks.
This issue was fixed a few years ago but on review, we decided we should have a CVE + to raise awareness of the issue.
+ +Mitigation:
+ Affected users are advised to update to Apache XMLBeans 3.0.0 or above
+ which fixes this vulnerability. XMLBeans 4.0.0 or above is preferable.
References: + XML external entity attack +
+Description:
+ When using the tool XSSFExportToXml to convert user-provided Microsoft
+ Excel documents, a specially crafted document can allow an attacker to
+ read files from the local filesystem or from internal network resources
+ via XML External Entity (XXE) Processing.
Mitigation:
+ Apache POI 4.1.0 and before: users who do not use the tool XSSFExportToXml
+ are not affected. Affected users are advised to update to Apache POI 4.1.1
+ which fixes this vulnerability.
Credit: + This issue was discovered by Artem Smotrakov from SAP
+ +References: + XML external entity attack +
+The Apache POI team is pleased to announce the release of XMLBeans 3.1.0. + Featured are a handful of bug fixes.
+The Apache POI project has unretired the XMLBeans codebase and is maintaining it as a sub-project, + due to its importance in the poi-ooxml codebase.
+A summary of changes is available in the + Release Notes. + People interested should also follow the POI dev list to track progress.
+The XMLBeans JIRA project has been reopened and feel free to open issues.
+POI 4.1.0 uses XMLBeans 3.1.0.
+XMLBeans requires Java 6 or newer since version 3.0.2.
+We did some work to verify that compilation with Java 11 is working and + that all unit-tests pass. +
+See the details in the FAQ entry.
++ The Apache POI Project's mission is to create and maintain Java APIs for manipulating various file formats + based upon the Office Open XML standards (OOXML) and Microsoft's OLE 2 Compound Document format (OLE2). + In short, you can read and write MS Excel files using Java. + In addition, you can read and write MS Word and MS PowerPoint files using Java. Apache POI is your Java Excel + solution (for Excel 97-2008). We have a complete API for porting other OOXML and OLE2 formats and welcome others to participate. +
++ OLE2 files include most Microsoft Office files such as XLS, DOC, and PPT as well as MFC serialization API based file formats. + The project provides APIs for the OLE2 Filesystem (POIFS) and + OLE2 Document Properties (HPSF). +
++ Office OpenXML Format is the new standards based XML file format found in Microsoft Office 2007 and 2008. + This includes XLSX, DOCX and PPTX. The project provides a low level API to support the Open Packaging Conventions + using openxml4j. +
++ For each MS Office application there exists a component module that attempts to provide a common high level Java api to both OLE2 and OOXML + document formats. This is most developed for Excel workbooks (SS=HSSF+XSSF). + Work is progressing for Word documents (WP=HWPF+XWPF) and + PowerPoint presentations (SL=HSLF+XSLF). +
++ The project has some support for Outlook (HSMF). Microsoft opened the specifications + to this format in October 2007. We would welcome contributions. +
++ There are also projects for + Visio (HDGF and XDGF), + TNEF (HMEF), + and Publisher (HPBF). +
++ As a general policy we collaborate as much as possible with other projects to + provide this functionality. Examples include: Cocoon for + which there are serializers for HSSF; + Open Office.org with whom we collaborate in documenting the + XLS format; and Tika / + Lucene, + for which we provide format interpretors. When practical, we donate + components directly to those projects for POI-enabling them. +
++ A major use of the Apache POI api is for Text Extraction applications + such as web spiders, index builders, and content management systems. +
++ So why should you use POIFS, HSSF or XSSF? +
++ You'd use POIFS if you had a document written in OLE 2 Compound Document Format, probably written using + MFC, that you needed to read in Java. Alternatively, you'd use POIFS to write OLE 2 Compound Document Format + if you needed to inter-operate with software running on the Windows platform. We are not just bragging when + we say that POIFS is the most complete and correct implementation of this file format to date! +
++ You'd use HSSF if you needed to read or write an Excel file using Java (XLS). You'd use + XSSF if you need to read or write an OOXML Excel file using Java (XLSX). The combined + SS interface allows you to easily read and write all kinds of Excel files (XLS and XLSX) + using Java. Additionally there is a specialized SXSSF implementation which allows to write + very large Excel (XLSX) files in a memory optimized way. +
++ The Apache POI Project provides several component modules some of which may not be of interest to you. + Use the information on our Components page to determine which + jar files to include in your classpath. +
++ So you'd like to contribute to the project? Great! We need enthusiastic, + hard-working, talented folks to help us on the project, no matter your + background. So if you're motivated, ready, and have the time: Download the + source from the + Subversion Repository, + build the code, join the + mailing lists, and we'll be happy to + help you get started on the project! +
++ Please read our Contribution Guidelines. + When your contribution is ready submit a patch to our + Bug Database. +
++ Apache POI™ releases are available under the Apache License, Version 2.0. + See the NOTICE file contained in each release artifact for applicable copyright attribution notices. Release artifacts are available + from the Download page. +
++All material on this website is Copyright © 2002-2025, The Apache +Software Foundation. +
++Apache POI, POI, Apache, the Apache feather logo, and the Apache POI +project logo are trademarks of The Apache Software Foundation. +
++Sun, Sun Microsystems, Solaris, Java, JavaServer Web Development Kit, +and JavaServer Pages are trademarks or registered trademarks of Sun +Microsystems, Inc. UNIX is a registered trademark in the United States +and other countries, exclusively licensed through 'The Open Group'. +Microsoft, Windows, WindowsNT, Excel, Word, PowerPoint, Visio, Publisher, Outlook, +and Win32 are registered trademarks of Microsoft Corporation. +Linux is a registered trademark of Linus Torvalds. +All other product names mentioned herein and throughout the entire +web site are trademarks of their respective owners. +
++ This distribution includes cryptographic software. The country in + which you currently reside may have restrictions on the import, + possession, use, and/or re-export to another country, of + encryption software. BEFORE using any encryption software, please + check your country's laws, regulations and policies concerning the + import, possession, or use, and re-export of encryption software, to + see if this is permitted. See + http://www.wassenaar.org/ + for more information. +
+ ++ The U.S. Government Department of Commerce, Bureau of Industry and + Security (BIS), has classified this software as Export Commodity + Control Number (ECCN) 5D002.C.1, which includes information security + software using or performing cryptographic functions with asymmetric + algorithms. The form and manner of this Apache Software Foundation + distribution makes it eligible for export under the License Exception + ENC Technology Software Unrestricted (TSU) exception (see the BIS + Export Administration Regulations, Section 740.13) for both object + code and source code. +
+ ++ The cryptographic software used is from java.security and + javax.crypto and is used when processing encrypted and + protected documents. +
++ These are articles/etc. posted about POI around the web. If you + see POI in the news or mentioned at least somewhat prominently + on a site (not your homepage that you put the work POI on in + order to get us to link you and by the why here is a picture of + your wife in kids) then send a patch to the list. In general + equal time will be given so please feel free to send inflammatory + defamation as well as favorable, technical and factual. Really + stupid things won't be mentioned (sorry). +
++ If you can read one of these languages, send mail to the list + telling us what language it is and we'll categorize it! +
++ This page lists other projects that you might find interesting when working with documents of various types. Suggestions for additional links are welcome, however please note that we only list open source projects here. Commercial applications can provide case studies if they want to show their support for POI. +
++ Apache Tika + is a toolkit which detects and extracts metadata and text from over a thousand different file types. +
++ Apache Drill + is a toolkit that allows the use of SQL querying on numerous file and data formats. The POI support is in + the excel-format-plugin. +
++ Apache Hop + is a data orchestration and data engineering platform. The POI support is in + the excelinput transform + and the excelwriter transform. + +
++ Apache DolphinScheduler + is a distributed and easy-to-extend visual workflow scheduler system. The POI support is in + the alert email component. + +
++ There is a Worksheet + plugin for JSPWiki which allows you to display contents of Excel + files as a table in JSPWiki. +
++ Apache Linkis (incubating) is a computation middleware layer. + The linkis-storage component has an Excel read capability built using Apache Poi. +
++ Apache Seatunnel (incubating) is a high-performance, distributed, massive data integration framework. + The seatunnel-connector-spark-email component uses Apache Poi. +
++ Apache ODF Toolkit (incubating) is a set of Java modules that allow programmatic creation, scanning and manipulation of OpenDocument Format (ISO/IEC 26300 == ODF) documents. + See also new website. +
++ Apache Corinthia (incubating) is a toolkit/application written in C++ for converting between and editing common office file formats, with an initial focus on word processing. +
++ Jackcess is a pure Java library for reading from and writing to MS Access databases available under Apache License 2.0. +
++ poi-mail-merge is a small tool to automate mail-merges, i.e. replacing strings in a template Microsoft Word file multiple times with data from a list of replacements + provided as Excel/CSV data. Available under the BSD 2-Clause License. +
+Merged into POI as of version 3.14
++ poi-visio is a Java library that loads Visio OOXML (vsdx) files and creates an in-memory data structure that allows full access to the contents of the document. + There is built-in support for easily traversing the content of the document in a structured way, and can render pages to simplified PNG files, or other backends supported by Java AWT. + Currently, the library only operates in read-only mode, but its design does not exclude being able to modify existing documents or creating new documents. + Available under the Apache License, Version 2.0. +
++ poi-visio-graph is a Java library that loads Visio OOXML (vsdx) files using the poi-visio library and creates an in-memory graph structure from the objects present on the page. + It utilizes user-specified connection points and also performs analysis to infer logical visual connection points between the objects on each page. + One possible use of this library is to create a network diagram from a Visio document. + Available under the Apache License, Version 2.0. +
++ NPOI is a .NET version of Apache POI available under Apache License 2.0. +
++ Vaadin Spreadsheet is a UI component add-on for Vaadin 7 which provides means to view and edit Excel spreadsheets in Vaadin applications. + Available under the Commercial Vaadin Add-on License version 3 (CVALv3). +
++ Excel module for Apache Isis is an add on for Apache Isis and provides a domain service so that a collection of (view model) + object scan be exported to an Excel spreadsheet, or recreated by importing from Excel. + Available under the Apache License, Version 2.0. +
++ Excel Streaming Reader uses the POI Streaming API to provide Row/Cell like read-access to large Excel spreadsheets. + Available under the Apache License, Version 2.0. +
++ Forked Version that supports the latest POI versions. + Has support for a number of extra features, including Strict OOXML files. + Also, available under the Apache License, Version 2.0. +
++ fastexcel is a benchmarked library for reading and writing Excel files. + Available under the Apache License, Version 2.0. +
++ poi-shared-strings is a memory efficient Shared Strings Table and Comments Table implementation for POI streaming. + Available under the Apache License, Version 2.0. +
++ The Wordinator abstracts the general problem of mapping from XML (or any similar structured content--with XSLT 3 you could just as easily process JSON content or some other format) to word processing data through a relatively simple XML structure, the Simple Word Processing Markup Language (SWPX), which is basically OOXML simplified way down. + Available under the Apache License, Version 2.0. +
++ POI-TL is a Word template engine that generates new documents based on a Word template and data. + Available under the Apache License, Version 2.0. +
++ XDocReport is a Java API to merge XML document created with MS Office (docx) or OpenOffice (odt), + LibreOffice (odt) with a Java model to generate report and convert it if you need to another format (PDF, XHTML...). + XDocReport code is license under MIT license but the samples are licensed under LGPL license. +
++ Frosted Sheets is a Groovy library which provides decorators for Apache POI spreadsheets, making it easier to work with spreadsheets + in Groovy. + Frosted Sheets is license under the Apache License, Version 2.0. +
++ iEXL is a commercial product which allows you to generate Excel spreadsheets on AS/400, iSeries, i5 or IBM i on Power systems. + It uses Apache POI internally. +
++ jotlmsg is a simple API (on top of POI) to easily generate Microsoft Outlook message files (.msg). +
++ HadoopOffice allows you to read and write Office documents while using the Hadoop ecosystem. + Available under the Apache License, Version 2.0. +
++ SPOIWO allows you to read and write Office documents using Scala friendly APIs. + Available under the MIT License. +
++ Spark Excel allows you to read and write Excel documents into/from Spark Dataframes. + Available under the Apache License, Version 2.0. +
++ ExcelUtil is a Java wrapper using Apache POI to read and write Excel files in declarative fashion. + Available under the Apache License, Version 2.0. +
++ dev-excel is a Java wrapper using Apache POI to read and write Excel files. + Available under the MIT License. +
+This page provides some guidance about how Apache POI can be used in security-sensible areas.
+Information about security issues is included in the Project News.
+Apache POI will try to fix security-related bugs with priority.
+ +Please follow the general Apache Security Guidelines + for proper handling.
+ +But please note that by the nature of processing external files, you should design your application + in a way which limits impact of malicious documents as much as possible. The higher your security-related + requirements are, the more you likely need to invest in your application to contain effects. +
+If you are processing documents from an untrusted source, you should add a number of safeguards to + your application to contain any unexpected side effects.
+ +Apache POI cannot fully protect against some documents causing impact on the current process, therefore + we suggest the following additional layers of security.
+ +For a number of years now, Apache POI has provided basic + text extraction for all the project supported file formats. In + addition, as well as the (plain) text, these provides access to + the metadata associated with a given file, such as title and + author.
+For more advanced text extraction needs, including Rich Text + extraction (such as formatting and styling), along with XML and + HTML output, Apache POI works closely with + Apache Tika to deliver + POI-powered Tika Parsers for all the project supported file formats.
+If you are after turn-key text extraction, including the latest + support, styles etc, you are strongly advised to make use of + Apache Tika, which builds + on top of POI to provide Text and Metadata extraction. If you wish + to have something very simple and stand-alone, or you wish to make + heavy modifications, then the POI provided text extractors documented + below might be a better fit for your needs.
+All of the POI text extractors extend from + org.apache.poi.extractor.POITextExtractor. This provides a common + method across all extractors, getText(). For many cases, the text + returned will be all you need. However, many extractors do provide + more targeted text extraction methods, so you may wish to use + these in some cases.
+All POIFS / OLE 2 based text extractors also extend from + org.apache.poi.extractor.POIOLE2TextExtractor. This additionally + provides common methods to get at the HPFS + document metadata.
+All OOXML based text extractors also extend from + org.apache.poi.POIOOXMLTextExtractor. This additionally + provides common methods to get at the OOXML metadata.
+POI provides a common class to select the appropriate text extractor + for you, based on the supplied document's contents. + ExtractorFactory provides a + similar function to WorkbookFactory. You simply pass it an + InputStream, a File, a POIFSFileSystem or a OOXML Package. It + figures out the correct text extractor for you, and returns it.
+For complete detection and text extractor auto-selection, users + are strongly encouraged to investigate + Apache Tika.
+For .xls files, there is + org.apache.poi.hssf.extractor.ExcelExtractor, which will + return text, optionally with formulas instead of their contents. + Similarly, for .xlsx files there is + org.apache.poi.xssf.extractor.XSSFExcelExtractor, which + provides the same functionality.
+For those working in constrained memory footprints, there are + two more Excel text extractors available. For .xls files, it's + org.apache.poi.hssf.extractor.EventBasedExcelExtractor, + based on the streaming EventUserModel code, and will generally + deliver a lower memory footprint for extraction. However, it will + have problems correctly outputting more complex formulas, as it + works with records as they pass, and so doesn't have access to all + parts of complex and shared formulas. For .xlsx files the equivalent is + org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor, + which is based on the XSSF SAX Event codebase.
+For .doc files from Word 97 - Word 2003, in scratchpad there is + org.apache.poi.hwpf.extractor.WordExtractor, which will + return text for your document.
+You can also extract simple textual content from + older Word 6 and Word 95 files, using the scratchpad class + org.apache.poi.hwpf.extractor.Word6Extractor.
+For .docx files, the relevant class is + org.apache.poi.xwpf.extractor.XWPFWordExtractor
+For .ppt and .pptx files, there is common extractor + org.apache.poi.sl.extractor.SlideShowExtractor.SlideShowExtractor, which + will return text for your slideshow, optionally restricted to just + slides text or notes text. For .ppt you need to add the poi-scratchpad.jar + and for .pptx the poi-ooxml.jar and its dependencies are needed
+For .pub files, in scratchpad there is + org.apache.poi.hpbf.extractor.PublisherExtractor, which + will return text for your file.
+For .vsd files, in scratchpad there is + org.apache.poi.hdgf.extractor.VisioTextExtractor, which + will return text for your file.
+Extractors already exist for Excel, Word, PowerPoint and Visio; + if one of these objects is embedded into a worksheet, the ExtractorFactory class can be used to recover an extractor for it. +
+| Type | +Bug | +Module | +Description | +
|---|