
In Progress
Posted
Paid on delivery
We run a Proxmox VE 7.4 cluster with approximately 10 production nodes hosting client VMs. The Problem: Two nodes have been intermittently disconnecting and reconnecting from the cluster (flapping). Each time a node rejoins, it resets the global cluster MTU to 1397 (instead of 1500), which triggers a cascade of failures: cpg_send_message failures Token timeouts (Token has not been received in ~30,000ms) pmxcfs leaving the CPG group Full cluster becoming unresponsive (UI freezes, VMs unreachable from management plane) We have isolated the two problematic nodes as a temporary fix, but we need a permanent solution. Root cause (suspected): Network instability on bond0 interface causing a network loop, confirmed by: vmbr0: received packet on bond0 with own address as source address What we need: A Proxmox/Corosync configuration that prevents one flapping node from destabilizing the entire cluster Recommendations on [login to view URL] tuning (token, MTU handling, etc.) Best practices for bond0 configuration to prevent network loops Ideally a monitoring/alerting script that detects MTU changes and auto-isolates the offending node Important notes: -No direct server access will be provided -Full logs will be shared (journalctl, dmesg, corosync logs) -We can run any diagnostic commands you specify and share output -We can test configuration changes during a maintenance window Skills needed: Proxmox VE, Corosync, Linux networking, bonding/bridging, cluster administration Budget: Open to offers from experienced Proxmox admins only. Key observation (important for diagnosis): When we manually stop and mask corosync on the affected nodes: bashsystemctl stop corosync systemctl mask corosync The entire cluster recovers immediately — UI returns, all other nodes go green, and cpg_send_message errors stop completely. This confirms the issue is isolated to these two nodes and their network behavior, not a general cluster misconfiguration. The moment they are removed from the cluster ring, everything stabilizes. The #1 goal: We need a solution where if any single node becomes unstable or starts flapping, it gets automatically isolated, and go down — WITHOUT affecting the rest of the cluster. Currently, one bad node can bring down all 10 nodes. This is unacceptable in a production environment.
Project ID: 40418430
47 proposals
Remote project
Active 11 secs ago
Set your budget and timeframe
Get paid for your work
Outline your proposal
It's free to sign up and bid on jobs

The 'vmbr0: received packet on bond0 with own address as source address' line plus the MTU collapse to 1397 on every rejoin is a layer-2 loop on the bond, and corosync is just the loudest victim. 1397 is knet's default fallback when it can't trust path MTU, knet gives up and renegotiates downward, which trips token loss because the configured size no longer fits, then pmxcfs leaves CPG and the UI freezes. I run Proxmox at home (PVE 8, ZFS, knet ring on a dedicated NIC) and have walked this exact failure on bond mode 0, round-robin against a non-LAG switch is the usual culprit, sometimes a duplicate cable into the same access VLAN without STP on the bridge. M1 ($125, ~3 days): review the journalctl/dmesg/corosync logs you share, confirm root cause vs corosync timing, hand back a hardened [login to view URL] (knet transport, token=10000, retransmits, pinned knet_mtu), a bond0/bridge change plan (mode review, miimon, STP, MTU lock), and the exact diagnostic commands you run with expected output so we agree on the read before touching anything. M2 ($125, ~4 days): systemd watchdog that detects MTU drift, token loss, and membership flap and auto-fences the offender via corosync-cfgtool -k before pmxcfs leaves CPG, applied during your maintenance window with rollback, plus post-change validation and a short README. One question before I dig into logs, are the two flapping nodes on the same switch and bond config as the stable eight, or do they differ on ports, cable runs, or firmware? That shifts whether M1 lands as a corosync fix or a network-side fix.
$250 USD in 7 days
1.6
1.6
47 freelancers are bidding on average $145 USD for this job

I understand that stable cluster performance is crucial for your Proxmox VE 7.4 environment, and I am confident in my ability to provide a lasting solution to the intermittent flapping and network instability you're experiencing with your two nodes. With over a decade of experience as a Network, Cybersecurity, VoIP and System Engineer, I have dealt with similar complex issues in my career and successfully resolved them. In regards to your project, I possess an extensive understanding of Proxmox VE, Corosync, Linux networking, bonding/bridging, and cluster administration which are all key skills required here. Moreover, my proficiency with troubleshooting complex network issues has equipped me with the ability to pinpoint specific problems such as the MTU reset and network loop you mentioned.
$160 USD in 2 days
7.3
7.3

My name is Muhammad Abdullah, an experienced IT professional with a breadth of skills that make me a perfect fit for your project. During my eight years in the field, I have specialized in Linux system administration and virtualization, particularly using tools like Proxmox VE and Corosync. I've successfully managed and optimized various types of clusters, consistently ensuring their stability and performance.
$140 USD in 1 day
6.6
6.6

** THIS IS NOT AN AUTOMATIC BID ** Hello, the errors you have described point to four to five directions, the first step would be tuning your corosync configuration as based on what you have mentioned it holds the ring for about 30,000 ms whereas it should have a 5,000 ms timeout so it is auto reformed, second step would be the MTU lock, where the MTU reset is happening because of auto-negotiation with the rejoining node's interface MTU, thirdly the bond0 network loop appears to be an L2 loop, finally for auto-isolation I can write you a .sh script that will allow the node to be self isolated in case of mismatching parameters, all of this needs to be discussed hence I would appreciate if you can drop a message and we can discuss both ends, I will also need you to run some diagnostic commands, the cost of this project will be fixed at 119.99 USD and take about 2-3 hours to discuss and complete. Thank you!
$119.99 USD in 1 day
5.4
5.4

Hello Dear! Greetings from Toriqul Global Solutions! We are a reliable and experienced web design & development company led by Engineer Toriqul Islam (B.Sc. in CSE, RUET), with over 10 years of proven industry experience delivering quality digital solutions. At Toriqul Global Solutions, we build modern, user-friendly, high-performance websites focused on simplicity, elegance, and functionality to boost engagement and business growth. I have some questions——— Technologies We Use: Custom Websites Development Using ======>Full Stack Development. 1. HTML5 2. CSS3 3. Bootstrap4 4. jQuery 5. JavaScript 6. AngularJS 7. React JS 8. Node.js 9. WordPress 10. PHP 11. Ruby on Rails 12. MYSQL 13. Laravel 14. .NET 15. CodeIgniter 16. React Native 17. SQL / MySQL 18. Mobile app development 19. Python 20. MongoDB and more skills What will you get? • Responsive design on All Devices •Reusable components •Clean code •Timely delivery tested •Clear communication We would be honored to discuss your project requirements and help bring your ideas to life. Thank you for your time and consideration. Warm Regards, Toriqul Global Solutions
$70 USD in 3 days
4.8
4.8

Hi, I have gone through your project description and understand you’re looking to stabilize your Proxmox VE cluster and prevent unstable nodes from affecting the rest of the production environment. I have worked on Proxmox clusters, Corosync tuning, Linux networking, bonding/bridging, and HA infrastructure troubleshooting. I’ve handled issues related to node flapping, MTU inconsistencies, token timeouts, and cluster recovery in production environments. For this, I would analyze the logs from the affected nodes, review the bond0 and bridge configuration, tune the corosync settings, and implement safeguards so unstable nodes isolate automatically without impacting the remaining cluster. I can also help with monitoring scripts for MTU changes and node health detection. Best regards, Juan
$140 USD in 1 day
3.6
3.6

Dear Client, I’m an experienced full-stack developer with over 10 years of experience in web and mobile application development, specializing in building scalable, responsive, and high-performance solutions for diverse business needs. I understand you are looking for a reliable developer to build or improve your project, including web or mobile applications similar to CRM, dashboards, or APIs, and I have worked on similar solutions successfully. My skills in React, Vue, Laravel, PHP, Python, REST APIs, and database design ensure efficient and high-quality delivery. Feel free to share more details or ask questions. I’m ready to refine my approach to match your exact requirements. Looking forward to working with you. Best regards, Md Ruhul Ajom
$80 USD in 3 days
2.6
2.6

Hey , I just finished reading the job description and I see you are looking for someone experienced in Troubleshooting, Virtualization, System Admin, Ubuntu, Linux, Automation and Network Administration. This is something I can do. Please review my profile to confirm that I have great experience working with these tech stacks. While I have few questions: 1. These are all the requirements? If not, Please share more detailed requirements. 2. Do you currently have anything done for the job or it has to be done from scratch? 3. What is the timeline to get this done? Why Choose Me? 1. I have done more than 250 major projects. 2. I have not received a single bad feedback since the last 5-6 years. 3. You will find 5 star feedback on the last 100+ major projects which shows my clients are happy with my work. Timings: 9am - 9pm Eastern Time (I work as a full time freelancer) I will share with you my recent work in the private chat due to privacy concerns! Please start the chat to discuss it further. Regards, Adil.
$30 USD in 3 days
2.3
2.3

A 10-node Proxmox cluster this size usually breaks down at the corosync layer, storage I/O, or VLAN misconfiguration. I would pull the cluster logs, check quorum settings, and trace the issue to a specific node or config. I can start today and have a diagnosis within 24 hours, with fixes applied the same day in most cases. These estimates are based on the description as written. Final scope depends on what the logs show. Want to jump on a quick call?
$150 USD in 5 days
1.0
1.0

Hello! I’ve dealt with similar Proxmox cluster issues, where node instability caused significant disruptions. In one case, I implemented a solution that improved cluster resilience, ensuring that one unstable node wouldn’t impact the others. I can share the implementation details in chat. For your situation, I would recommend tuning the Corosync configuration, specifically focusing on token timeouts and MTU settings, while also creating a monitoring script that can isolate problematic nodes automatically. Have you considered exploring different bonding modes for your network interfaces to avoid loops? If you’re open, I can share my similar build and we can see if it fits your requirements. Looking forward to your thoughts!
$30 USD in 7 days
0.6
0.6

Hello, With over 10 years of professional experience in Systems Administration, I am well-versed and thoroughly experienced in the management and maintenance of complex server infrastructures, such as the one detailed in your project. I have an advanced understanding of Proxmox VE, Corosync, Linux networking, bonding/bridging and cluster administration - skills that align closely with your key project requirements. To address your current issue with node instability and the toggling MTU settings, I will utilize my deep understanding of Corosync's token handling and network configuration. This will enable me to design a streamlined, stable cluster architecture that isolates any unpredictable nodes without compromising the integrity and performance of the entire system. Additionally, I can offer insights into best practices for bond0 configuration to prevent network loops - an issue you've rightly identified as a suspected root cause. My approach is to build infrastructure solutions that adapt to their environments rather than forcing environments to accommodate infrastructure changes. Considering this, I'll recommend well-tested Corosync and Linux networking configurations tailored specifically for your needs. Thanks!
$30 USD in 3 days
0.0
0.0

We recently completed a Proxmox cluster optimization that improved stability by isolating unstable nodes automatically and preventing cluster-wide failures. I am new to Freelancer but have experience contributing to large scale projects alongside teams at Google and Amazon where I handled cluster reliability and network configuration challenges. Your project requires a clean, efficient, and scalable Corosync and network bonding configuration that isolates flapping nodes without impacting the entire cluster. The solution should be user friendly, prevent network loops, and include monitoring to detect MTU changes with automated node isolation. I focus on simplicity and structure to build long term, reliable solutions without unnecessary complexity, ensuring systems work correctly from day one. I am ready to start designing a robust setup aligned with your operational needs. If this aligns with your project, feel free to reach out to discuss scope and pricing. Regards Patrick
$200 USD in 1 day
0.0
0.0

The recurring MTU reset and subsequent cascading failures point to a Corosync configuration vulnerability exacerbated by the bond0 instability. I’ve previously resolved similar Proxmox cluster instability issues by focusing on Corosync parameter tuning and robust bonding configurations to prevent network loops, specifically addressing token handling and MTU consistency across nodes. Given the provided confirmation that stopping corosync resolves the issue, I’ll prioritize analyzing the logs and identifying the specific Corosync parameters and bond0 settings contributing to the flapping behavior.
$126 USD in 7 days
0.0
0.0

Hi there, I noticed you need a professional video editor who can combine AI tools with advanced editing to create cinematic, engaging content, I currently manage end to end video editing workflows where AI enhancement, background replacement, color grading, and high quality social media edits run smoothly, and I would love to do the same for you. I recently helped a content brand upgrade their videos using tools like Runway and Adobe AI, improving visual quality, pacing, and retention while keeping a clean, modern style. Your requirement for interior enhancement, lighting correction, smooth transitions, and platform ready output is exactly how I structure edits, making sure each video feels polished, engaging, and optimized for marketing performance. Best regards, Mobasher Reza
$140 USD in 3 days
0.0
0.0

Hi, This looks like a network + Corosync issue — especially with MTU dropping and the loop warning on bond0. I can review your logs and guide you to: - Fix the bond/bridge config (likely loop or MTU mismatch) - Stabilize Corosync so one bad node doesn’t affect the cluster - Add a simple check to auto-isolate flapping nodes I’ve handled similar Proxmox cluster instability cases, so I know where to look. Happy to start with your logs and give quick actionable fixes. Best regards Alan
$50 USD in 2 days
0.0
0.0

Hi, I’ve worked with Proxmox clusters and this looks like a classic case where one unstable node is taking down the whole cluster due to Corosync + network behavior. The fact that everything stabilizes when you stop corosync on those nodes makes it clear — we need to make sure a bad node isolates itself instead of affecting others. I can help you: Review your logs and current config Fix corosync settings (token, MTU handling, stability) Identify and correct the bond0/network loop issue Set up a simple way to detect and isolate unstable nodes automatically We can do it step by step with your logs and testing during your maintenance window. If you share your corosync config and bond setup, I can start right away.
$120 USD in 2 days
0.0
0.0

Hi, I can help stabilize your Proxmox cluster and prevent one flapping node from affecting the entire environment. My approach would focus on tuning Corosync settings, fixing bond0/network instability, and ensuring consistent MTU across the cluster. I would also implement a simple monitoring mechanism to detect node flapping or MTU changes and automatically isolate the affected node before it impacts quorum. Goal is clear: cluster should degrade safely, not fail globally. I’ll work via logs and guided commands since no direct access is available, and test changes during maintenance windows. Best, Haroon
$140 USD in 7 days
0.0
0.0

riyadh, Saudi Arabia
Payment method verified
Member since Jun 5, 2018
$10-30 USD
$15-25 USD / hour
$10-30 USD
$250-750 USD
$10-30 USD
$15-25 USD / hour
$30-250 USD
min $50 USD / hour
$750-1500 USD
$12-30 SGD
₹600-1500 INR
$15-25 USD / hour
₹1500-12500 INR
$30-250 AUD
₹600-1500 INR
₹1500-12500 INR
$10-30 USD
₹600-3000 INR
₹600-1500 INR
£250-750 GBP
$15-25 USD / hour
$250-750 USD
$30-250 AUD
₹600-1500 INR
$10-50 USD