【solr源码解析】Reindexing

内容纲要

https://solr.apache.org/guide/8_11/reindexing.html

There are several types of changes to Solr configuration that require you to reindex your data, particularly changes to your schema.

These changes include editing properties of fields or field types; adding fields, or copy field rules; upgrading Solr; and changing certain system configuration properties.

It’s important to be aware that failing to reindex can have both obvious and subtle consequences for Solr or for users finding what they are looking for.

"Reindex" in this context means first delete the existing index and repeat the process you used to ingest the entire corpus from the system-of-record. It is strongly recommended that Solr users have a consistent, repeatable process for indexing so that the indexes can be recreated as the need arises.

Re-ingesting all the documents in your corpus without first insuring that all documents and Lucene segments have been deleted is not sufficient, see the section Reindexing Strategies.

Reindexing is recommended during major upgrades, so in addition to covering what types of configuration changes should trigger a reindex, this section will also cover strategies for reindexing.

Changes that Require Reindex

Schema Changes

With very few exceptions, changes to a collection’s schema require reindexing. This is because many of the available options are only applied during the indexing process. Solr has no way to implement the desired change without reindexing the data.

To understand the general reason why reindexing is ever required, it’s helpful to understand the relationship between Solr’s schema and the underlying Lucene index. Lucene does not use a schema, schemas a Solr-only concept. When you change Solr’s schema, the Lucene index is not modified in any way.

This means that there are many types of schema changes that cannot be reflected in the index simply by modifying Solr’s schema. This is different from most database models where schemas are used. When indexing, Solr’s schema acts like a rulebook for indexing documents by telling Lucene how to interpret the data being sent. Once the documents are in Lucene, Solr’s schema has no control over the underlying data structure.

In addition, changing the schema version property is equivalent to changing field type properties. This type of change is usually only made during or because of a major upgrade.

Changing Field and Field Type Properties

When you change your schema by adding fields, removing fields, or changing the field or field type definitions you generally do so with the intent that those changes alter how documents are searched. The full effects of those changes are not reflected in the corpus as a whole until all documents are reindexed.

Changes to any field/field type property described in Field Type Properties must be reindexed in order for the change to be reflected in all documents.

Changing field properties that affect indexing without reindexing is not recommended. This should only be attempted with a thorough understanding of the consequences. Negative impacts on the user may not be immediately apparent.

Changing Field Analysis

Beyond specific field-level properties, analysis chains are also configured on field types, and are applied at index and query time.

If separate analysis chains are defined for query and indexing events for a field and you change only the query-time analysis chain, reindexing is not necessary.

Any change to the index-time analysis chain requires reindexing in almost all cases.

Solrconfig Changes

Identifying changes to solrconfig.xml that alter how data is ingested and thus require reindexing is less straightforward. The general rule is "anything that changes what gets stored in the index requires reindexing". Here are several known examples.

The parameter luceneMatchVersion in solrconfig.xml controls the compatibility of Solr with Lucene. Since this parameter can change the rules for analysis behind the scenes, it’s always recommended to reindex when changing it. Usually this is only changed in conjunction with a major upgrade.

If you make a change to Solr’s Update Request Processors, it’s generally because you want to change something about how update requests (documents) are processed (indexed). In this case, we recommend that you reindex your documents to implement the changes you’ve made just as if you had changed the schema.

Similarly, if you change the codecFactory parameter in solrconfig.xml, it is again strongly recommended that you plan to reindex your documents to avoid unintended behavior.

Upgrades

When upgrading between major versions (for example, from a 7.x release to 8.x), a best practice is to always reindex your data. The reason for this is that subtle changes may occur in default field type definitions or the underlying code.

Lucene works hard to insure one major version back-compatability, thus Solr 8x functions with indexes created with Solr 7x. However, given that this guarantee does not apply to Solr X-2 (Solr 6x in this example) we still recommend completely reindexing when moving from Solr X-1 to Solr X.

If you have not changed your schema as part of an upgrade from one minor release to another (such as, from 7.x to a later 7.x release), you can often skip reindexing your documents. However, when upgrading to a major release, you should plan to reindex your documents.
You must always reindex your corpus when upgrading an index produced with a Solr version more than X-1 old. For instance, if you’re upgrading to Solr 8x, an index ever used by Solr 6x must be deleted and re-ingested as outlined below. A marker is written identifying the version of Lucene used to ingest the first document. That marker is preserved in the index forever unless the index is entirely deleted. If Lucene finds a marker more than X-1 major versions old, it will refuse to open the index.

Reindexing Strategies

There are a few approaches available to perform the reindex.

The strategies described below ensure that the Lucene index is completely dropped so you can recreate it to accommodate your changes. They allow you to recreate the Lucene index without having Lucene segments lingering with stale data.

A Lucene index is a lossy abstraction designed for fast search. Once a document is added to the index, the original data cannot be assumed to be available. Therefore it is not possible for Lucene to "fix up" existing documents to reflect changes to the schema, they must be indexed again.

There are a number of technical reasons that make re-ingesting all documents correctly without deleting the entire corpus first difficult and error-prone to code and maintain.

Therefore, since all documents have to be re-ingested to insure the abstraction faithfully reflects the new schema for all documents, we recommend deleting all documents after insuring that there are no old Lucene segments or reindexing to a new collection.

Delete All Documents

The best approach is to first delete everything from the index, and then index your data again. You can delete all documents with a "delete-by-query", such as this:

curl -X POST -H 'Content-Type: application/json' --data-binary '{"delete":{"query":"*:*" }}' http://localhost:8983/solr/my_collection/update

It’s important to verify that all documents have been deleted, as that ensures the Lucene index segments have been deleted as well.

To verify that there are no segments in your index, look in the data/index directory and confirm it has no segments files. Since the data directory can be customized, see the section Specifying a Location for Index Data with the dataDir Parameter for the location of your index files.

Note you will need to verify the indexes have been removed in every shard and every replica on every node of a cluster. It is not sufficient to only query for the number of documents because you may have no documents but still have index segments.

Once the indexes have been cleared, you can start reindexing by re-running the original index process.

A variation on this approch is to delete and recreate your collection using the updated schema, then reindex if you can afford to have your collection offline for the duration of the reindexing process.

Index to Another Collection

Another approach is to use index to a new collection and use Solr’s collection alias feature to seamlessly point the application to a new collection without downtime.

This option is only available for Solr installations running in SolrCloud mode.

With this approach, you will index your documents into a new collection that uses your changes and, once indexing and testing are complete, create an alias that points your front-end at the new collection. From that point, new queries and updates will be routed to the new collection seamlessly.

Once the alias is in place and you are satisfied you no longer need the old data, you can delete the old collection with the Collections API DELETE command.

One advantage of this option is that if you you can switch back to the old collection if you discover problems our testing did not uncover. Of course this option can require more resources until the old collection can be deleted.

Changes that Do Not Require Reindex

The types of changes that do not require or strongly indicate reindexing are changes that do not impact the index.

Creating or modifying request handlers, search components, and other elements of solrconfig.xml don’t require reindexing.

Cluster and core management actions, such as adding nodes, replicas, or new cores, or splitting shards, also don’t require reindexing.

发表回复