Ovler

Ovler

tg_channel
twitter
telegram
github

Pitfalls of Syncing MongoDB with MeiliSearch

Using meilisync

Thanks to the strong support from yzqzss

tl;dr: forked and significantly modified the program, see

Broken Frontend#

First, I tried using his Admin Console, https://github.com/long2ice/meilisync-admin

There are only AMD64 images, no ARM images. Who doesn't have an x86_64 machine? I tried using it on a non-database machine. At that time, I didn't realize the concept of pulling data from machine A to machine B and then stuffing it into machine C, but anyway, I was being foolish at that time.

Download the image, run it... wait, why does this database sync admin still need MySQL and Redis DBurl? I didn't understand but configured it anyway.

Then...

  1. No initial account ref #7

    The solution was to manually write the email and password into the database and also manually create a bcrypt hash.

  2. Create MongoDB data source, error Unknown option user #11

    The reason was that different databases require different parameters in the backend configuration file, the webpage only passed parameters according to PostgreSQL's user, and the backend didn't handle it and just stuffed it, causing the sync program to explode.

    Solved by modifying and replaying the packet.

    I originally wrote a fix but found that PostgreSQL's parameter was indeed correct, so whatever, I felt I couldn't manage it. I also don't think anyone would want to use this after seeing it... right?

  3. After setting everything up, it still wouldn't run... the backend had the following error (deleted a ton of content):

    2025-04-11 21:38:42.156 | INFO     | uvicorn.protocols.http.httptools_impl:send:496 - 10.0.1.1:64000 - "POST /api/sync HTTP/1.1" 500
    ERROR:    Exception in ASGI application
    Traceback (most recent call last):
      File "/meilisync_admin/meilisync_admin/models.py", line 64, in meili_client
        self.meilisearch.api_url,
        ^^^^^^^^^^^^^^^^^^^^^^^^
    AttributeError: 'QuerySet' object has no attribute 'api_url'
    

    Uh... it seems like it's not just a simple configuration file issue...

So I tried using the CLI, avoiding the possibility that the demonic Admin Console was causing the problem, thinking it was the beginning of the end, but it turned out to be just the end of the beginning.

Delayed Docker#

Found the actual program for syncing, https://github.com/long2ice/meilisync

I still tried Docker, after all, it was the "Recommended" method. But according to his README's compose,

version: "3"
services:
  meilisync:
    image: long2ice/meilisync
    volumes:
      - ./config.yml:/meilisync/config.yml
    restart: always

The pulled image had issues.

I encountered a TypeError: 'async for' requires an object with aiter method, got list #94 but there was also TypeError: 'async for' requires an object with aiter method, got coroutine #76

Hmm, I need to use dev.

Oh right, like mentioned above, his MongoDB user field is username, which is different from the template. Who knows how I suddenly thought it was username back then.

After several back-and-forths with the configuration file, I didn't want to deal with Docker anymore, so I switched to local CLI.

Local Python#

During the local CLI phase, everything seemed to be going in a good direction.

The few problems were that, although there is pip install meilisync[mongo] for MongoDB, just installing this is not enough. Running any command would harshly tell you what's missing. You can only pip install meilisync[all] for all.

Also, a small bug to fix in zsh.

$ pip install meilisync[mongo]
zsh: no matches found: meilisync[mongo]

Finally configured the config, everything seemed to be developing positively... or not?

Exploding Progress#

Referring to #17 in the reply, when using MongoDB as the data source, progress.json may not be automatically generated, leading to a bunch of TypeError: meilisync.progress.file.File.set() argument after ** must be a mapping, not NoneType.

The solution is to first touch progress.json, then write into the case...

{"resume_token": {"_data": "8267FBA647000000022B042C0100296E5A10046F963A9EB7AB4D14B8CF191E8E5E8D67463C6F7065726174696F6E54797065003C696E736572740046646F63756D656E744B65790046645F6964006467FBA6470D168B18625CC73E000004"}}

Heavy Logs#

After testing and finding no major issues, I modified the configuration file to turn off debug, and after running it with nohup, I went to do other things until a hard disk alert pulled me back to the shell. Syncing the database to a new location indeed consumes a lot of space, and I was prepared for that. But I really didn't expect that this hard disk would explode first—its growth rate was even greater than that of the Meili database machine—this harmless syncer dumped 6GB of logs in my face.

This is impossible; I clearly wrote debug=false in the configuration file—

I checked the huge log and found that it recorded every single sync content in plain text...

It was actually caused by the default plugin instance.

In the configuration file, there was the following content:

debug: false
plugins:
  - meilisync.plugin.Plugin

In this plugin, regardless of the value of debug set in the configuration file, it will write debug logs.

There are three solutions:

  1. Do not reference this plugin

  2. Modify the plugin content

  3. Change the global log level:

    Meilisync uses loguru, and according to its documentation, you can set the level, and according to its environment variable documentation, you can set LOGURU_LEVEL, which can take values as shown in the table below:

    Level nameSeverity valueLogger method
    TRACE5logger.trace()
    DEBUG10logger.debug()
    INFO20logger.info()
    SUCCESS25logger.success()
    WARNING30logger.warning()
    ERROR40logger.error()
    CRITICAL50logger.critical()

    Then set the environment variable:

    On Unix:

    export LOGURU_LEVEL=INFO
    

    On Windows:

    PowerShell

    $env:LOGURU_LEVEL="INFO"
    

    CMD

    set LOGURU_LEVEL=INFO
    

Looking back, this saved a ton of space...

In various debug sessions, I migrated this service from B to C, where Meili Search is located. This later proved to significantly improve speed.

Index Concerns#

My data has an id field, but various issues arise during actual use, so I still use _id as the primary key.

Hands-on Script Modification#

Corrected Types#

It seemed everything was running normally for a while, but the program just died. After several checks, I found it was TypeError: Object of type ObjectId is not JSON serializable. At this point, the progress was about 1,140,000 records.

There was also a GitHub Issue, #16, stating it was "fixed." I checked the local code, and it indeed contained the fix. But it seemed to appear again in #102, and this time there was no response.

The hardest part wasn't fixing the code, but reproducing the error. Since the progress would reset after an error, each time it was from the beginning, and it took 20 minutes to reach the error point, wasting several hours on this... Also, during the run, the CPU was fully utilized... I should be grateful that the small service provider's machine wouldn't trip the circuit...

By the way, during this time, I used sentry.io. It's strange that the author specifically left a hole for sentry.io in this sync tool, but it was really very useful. Maybe the author knew there would be bugs in various places?

Local Modifications#

So I implemented detection and repair. Initially, I tried to modify that plugin, but I couldn't sort out the content at first, so I decided to hard-code the source! I added extra type checks. I didn't want to pip install repeatedly, so I directly modified the files in site-packages, which was quick and effective.

Not long after fixing ObjectId, I saw that there were no issues for over half an hour, and the progress was gradually moving forward. I was just about to go to sleep when another error occurred, this time Object of type datetime is not JSON serializable, similar to #31. Same check added. This time it was at the 5,270,000 mark, about a third of the way. After fixing it, I could continue syncing, and it was halfway done, so I went to sleep with peace of mind.

Then I woke up to a thunderstorm; at about two-thirds of the way, it would throw Client error '408 Request Timeout' for url 'http://127.0.0.1:7700/tasks/xxxx. There was really no way; there were too many items to index, and the calculations couldn't keep up, leading to more and more backlog until it completely boom. But the abnormal thing was that this issue had also been fixed, mentioned in #13, but for some reason, it still exploded. At this point, I checked how much backlog there was and found that running for an hour would accumulate half an hour... I should be grateful that the "Too many open files" issue didn't occur...

Indexing was slow... wait, ultimately, it shouldn't be adding indexes this quickly!

Delayed Indexing#

I decided to try delaying it and discovered a scary fact: when creating an index, meilisync did not specify any field indexing options. So every field in the document would be displayed and searchable, which consumed a lot of resources and caused terrible waste. We absolutely didn't need to index all content from the start. Instead, when syncing data from a remote source, we shouldn't create any indexes until all content from the source has been successfully inserted, and we should be able to specify which fields are indexed in the config file, including the type of index (searchable, sortable, filterable, none).

So I wrote it. Currently, the logic is that no indexes should be created when syncing data from the remote source, which improved the sync speed by ten thousand times.

Then I wrote a function to set the index after syncing by modifying the settings. Everything seemed very normal until I woke up and found there were still no indexes.

This shouldn't be the case. After checking, I found that although I submitted the task to modify the index, it exploded halfway through:

Index `nmbxd`: internal: MDB_TXN_FULL: Transaction has too many dirty pages - transaction too big.

Finally, it wasn't a meilisync issue!

Memory Optimization#

After careful confirmation, it turned out that it didn't run for several hours. It ran for several minutes before the transaction became too big, and then it retried with a smaller batch until it couldn't try anything anymore.

Training!

After being reminded by @yzqzss, the best implementation is to set the attribution first, then sync the index, following best practices, but that might cause 408 again. To solve 408, I would need to implement a queue, but I was too lazy to write more code.

Later, I found a flag that could reduce memory usage during indexing, referring to https://github.com/meilisearch/meilisearch/issues/3603, --experimental-reduce-indexing-memory-usage, and indeed, it worked wonders; it succeeded in one go.

As for updates, running meilisync refresh will suffice.

After that, I wrote a systemd timer to schedule syncing, as follows.

# /etc/systemd/system/meilisync.timer
[Unit]
Description=Run meilisync refresh nmbxd weekly on Monday at 5 AM

[Timer]
# Run every Monday at 5:00 AM local time
OnCalendar=Mon *-*-* 05:00:00
Persistent=false

[Install]
WantedBy=timers.target
# /etc/systemd/system/meilisync.service
[Unit]
Description=Meilisync Refresh nmbxd
#After=network.target

[Service]
Type=oneshot
WorkingDirectory=/path/to/meilisync/config/
Environment='LOGURU_LEVEL=DEBUG' 
ExecStart=/etc/meilisync/meilisync/bin/meilisync refresh

Configuration file /path/to/meilisync/config/config.yml

debug: false
progress:
  type: file
source:
  type: mongo
  host: REDACTED
  port: REDACTED
  username: 'REDACTED'
  password: 'REDACTED'
  database: REDACTED
meilisearch:
  api_url: http://127.0.0.1:REDACTED
  api_key: REDACTED
  insert_size: 10000
  insert_interval: 10
sync:
  - table: REDACTED
    index: REDACTED
    full: true
    pk: _id
    attributes:
      id: [filterable, sortable]
      fid: [filterable]
      img: [filterable]
      ext: [filterable, sortable]
      now: [filterable, sortable]
      name: [searchable]
      title: [searchable]
      content: [searchable]
      parent: [filterable, sortable]
      type: [filterable]
      userid: [filterable]
sentry:
  dsn: ''
  environment: 'production'
Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.